Chaptr 7 (final)

Nateshwar Kamlesh
Nateshwar KamleshTrainer à bits bhopal

AI notes

Concept of Learning
Rote Learning
Learning by Taking Advice
Learning in Problem Solving
Learning by Induction
Explanation-Based Learning
Learning Automation
Learning in Neural Networks



Unit 7
Learning

  Learning Objectives
  After reading this unit you should appreciate the following:

  • Concept of Learning

  • Learning Automation

  • Genetic Algorithm

  • Learning by Induction

  • Neural Networks

  • Learning in Neural Networks

  • Back Propagation Network
Top

Concept of Learning

One of the most often heard criticisms of AI is that machines cannot be called intelligent until they
are able to learn to do new things and to adapt to new situations, rather than simply doing as they
are told to do. There can be little question that the ability to adapt to new surroundings and to
solve new problems is an important characteristic of intelligent entities. Can we expect to see
such abilities in programs? Ada Augusta, one of the earliest philosophers of computing, wrote that

The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we
know how to order it to perform.
LEARNING                                                                                         179


Several AI critics have interpreted this remark as saying that computers cannot learn. In fact, it
does not say that at all. Nothing prevents us from telling a computer how to interpret its inputs in
such a way that its performance gradually improves.

Rather than asking in advance whether it is possible for computers to "learn," it is much more
enlightening to try to describe exactly what activities we mean when we say "learning" and what
mechanisms could be used to enable us to perform those activities. Simon has proposed that
learning denotes changes in the system that are adaptive in the sense that they enable the
system to do the same task or tasks drawn from the same population more efficiently and more
effectively the next time.

As thus defined, learning covers a wide range of phenomena. At one end of the spectrum is skill
refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or
play tennis, the better you get. At the other end of the spectrum lies knowledge acquisition. As we
have seen, many AI programs draw heavily on knowledge as their source of power. Knowledge is
generally acquired through experience and such acquisition is the focus of this chapter.

Knowledge acquisition itself includes many different activities. Simple storing of computed
information, or rote learning, is the most basic learning activity. Many computer programs, e.g.,
database systems, can be said to "learn" in this sense, although most people would not call such
simple storage, learning. However, many AI programs are able to improve their performance
substantially through rote-learning technique and we will look at one example in depth, the
checker-playing program of Samuel.

Another way we learn is through taking advice from others. Advice taking is similar to rote
learning, but high-level advice may not be in a form simple enough for a program to use directly in
problem solving. The advice may need to be first operationalized.

People also learn through their own problem-solving experience. After solving a Complex
problem, we remember the structure of the problem and the methods we used to solve it. The
next time we see the problem, we can solve it more efficiently. Moreover, we can generalize from
our experience to solve related problems more easily contrast to advice taking, learning from
problem-solving experience does not usually involve gathering new knowledge that was
previously unavailable to the learning program. That is, the program remembers its experiences
and generalizes from them, but does not add to the transitive closure of its knowledge, in the
sense that an advice-taking program would, i.e., by receiving stimuli from the outside world. In
180                                                                              ARTIFICIAL INTELLIGENCE


large problem spaces, however, efficiency gains are critical. Practically speaking, learning can
mean the difference between solving a problem rapidly and not solving it at all. In addition,
programs that learn though problem-solving experience may be able to come up with qualitatively
better solutions in the future.

Another form of learning that does involve stimuli from the outside is learning from examples. We
often learn to classify things in the world without being given explicit rules. For example, adults
can differentiate between cats and dogs, but small children often cannot. Somewhere along the
line, we induce a method for telling cats from dogs - based on seeing numerous examples of
each. Learning from examples usually involves a teacher who helps us classify things by
correcting us when we are wrong. Sometimes, however, a program can discover things without
the aid of a teacher.

AI researchers have proposed many mechanisms for doing the kinds of learning described
above. In this chapter, we discuss several of them. But keep in mind throughout this discussion
that learning is itself a problem-solving process. In fact, it is very difficult to formulate a precise
definition of learning that distinguishes it from other problem-solving tasks.

Top

Rote Learning

When a computer stores a piece of data, it is performing a rudimentary form of learning. After all,
this act of storage presumably allows the program to perform better in the future (otherwise, why
bother?). In the case of data caching, we store computed values so that we do not have to
recompute them later. When computation is more expensive than recall, this strategy can save a
significant amount of time. Caching has been used in AI programs to produce some surprising
performance improvements. Such caching is known as rote learning.
LEARNING                                                                                         181




                                Figure 7.1: Storing Backed-Up Values


We mentioned one of the earliest game-playing programs, Samuel's checkers program. This
program learned to play checkers well enough to beat its creator. It exploited two kinds of
learning: rote learning and parameter (or coefficient) adjustment. Samuel's program used the
minimax search procedure to explore checkers game trees. As is the case with all such
programs, time constraints permitted it to search only a few levels in the tree. (The exact number
varied depending on the situation.) When it could search no deeper, it applied its static evaluation
function to the board position and used that score to continue its search of the game tree. When it
finished searching the tree and propagating the values backward, it had a score for the position
represented by the root of the tree. It could then choose the best move and make it. But it also
recorded the board position at the root of the tree and the backed up score that had just been
computed for it as shown in Figure 7.1(a).

Now consider a situation as shown in Figure 7.1(b). Instead of using the static evaluation function
to compute a score for position A, the stored value for A can be used. This creates the effect of
182                                                                           ARTIFICIAL INTELLIGENCE


having searched an additional several ply since the stored value for A was computed by backing
up values from exactly such a search.

Rote learning of this sort is very simple. It does not appear to involve any sophisticated problem-
solving capabilities. But even it shows the need for some capabilities that will become
increasingly important in more complex learning systems. These capabilities include:

1.    Organized Storage of Information-in order for it to be faster to use a stored value than it
      would to recompute it, there must be a way to access the appropriate stored value quickly. In
      Samuel's program, this was done by indexing board positions by a few important
      characteristics, such as the number of pieces. But as the complexity of the stored
      information increases, more sophisticated techniques are necessary.

2.    Generalization-The number of distinct objects that might potentially be stored can be very
      large. To keep the number of stored objects down to a manageable level, some kind of
      generalization is necessary. In Samuel's program, for example, the number of distinct
      objects that could be stored was equal to the number of different board positions that can
      arise in a game. Only a few simple forms of generalization were used in Samuel's program
      to cut down that number. All positions are stored as though White is to move. This cuts the
      number of stored positions in half. When possible, rotations along the diagonal are also
      combined. Again, though, as the complexity of the learning process increases, so too does
      the need for generalization. .

At this point, we have begun to see one way in which learning is similar to other kinds of problem
solving. Its success depends on a good organizational structure for its knowledge base.

                                       Student Activity 7.1

Before reading next section, answer the following questions.

1.    Discuss role of A.I. in learning.

2.    What do you mean by skill refinement and knowledge acquisition?

3.    Would it be reasonable to apply rote learning procedure to chess? Why?

If your answers are correct, then proceed to the next section.

Top

Learning by Taking Advice
LEARNING                                                                                        183


A computer can do very little without a program for it to run. When a programmer writes a series
of instructions into a computer, a rudimentary kind of learning is taking place. The programmer is
a sort of teacher, and the computer is a sort of student. After being programmed, the computer is
now able to do something it previously could not. Executing the program may not be such a
simple matter, however. Suppose the program is written in a high-level language like LISP. Some
interpreter or compiler must intervene to change the teacher's instructions into code that the
machine can execute directly.

People process advice in an analogous way. In chess, the advice "fight for control of the center of
the board" is useless unless the player can translate the advice into concrete moves and plans. A
computer program might make use of the advice by adjusting its static evaluation function to
include a factor based on the number of center squares attacked by its own pieces.

Mostow describes a program called FOO, which accepts advice for playing hearts, a card game.
A human user first translates the advice from English into a representation that FOO can
understand. For example, "Avoid taking points" becomes:

      (avoid (take-points me) (trick))

FOO must operationalize this advice by turning it into an expression that contains concepts and
actions FOO can use when playing the game of hearts. One strategy FOO can follow is to
UNFOLD an expression by replacing some term by its definition. By UNFOLDing the definition of
avoid, FOO comes up with:

      (achieve (not (during (trick) (take-points me))))

FOO considers the advice to apply to the player called "me." Next, FOO UNFOLDs the definition
of trick:

      (achieve (not (during

      (scenario

      (each pI (players) (play-card p1))

      (take-trick (trick-winner))

      (take-points me))))

In other words, the player should avoid taking points during the scenario consisting of (1) players
playing cards and (2) one player taking the trick. FOO then uses case analysis to determine which
184                                                                              ARTIFICIAL INTELLIGENCE


steps could cause one to take points. It rules out step 1 on the basis that it knows of no
intersection of the concepts take-points and play-card. But step 2 could affect taking points, so
FOO UNFOLDs the definition of take-points:

      (achieve (not (there-exists c1 (cards-played)

      (there-exists c2 (point-cards)

      (during (take (trick-winner) cl)

      (take me c2))))))

This advice says that the player should avoid taking point-cards during the process of the trick-
winner taking the trick. The question for FOO now is: Under what conditions does (take me c2)
occur during (take (trick-winner) cl)? By using a technique called partial match, FOO hypothesizes
that points will be taken if me = trick-winner and c2 = cl. It transforms the advice into:

      (achieve (not (and (have-points (cards-played > = (trick-winner) me))))

This means, "Do not win a trick that has points." We have not travelled very far conceptually from
"avoid taking points," but it is important to note that the current vocabulary is one that FOO can
understand in terms of actually playing the game of t. hearts. Through a number of other
transformations, FOO eventually settles on:

      (achieve (>= (and (in-suit-led (card-of me))

      (possible (trick-has-points)))

      (low (card-of me)))

In other words, when playing a card that is the same suit as the card that was played first, if the
trick possibly contains points, then playa low card. At last, FOO has translated the rather vague
advice "avoid taking points" into a specific, usable heuristic. FOO is able to playa better game of
hearts after receiving this advice. A human can watch FOO play, detect new mistakes, and
correct them through yet more advice, such as "play high cards when it is safe to do so." The
ability to operationalize knowledge is critical for systems that learn from a teacher's advice. It is
also an important component of explanation-based learning.

Top

Learning in Problem Solving
LEARNING                                                                                          185


In the last section, we saw how a problem solver could improve its performance by taking advice
from a teacher. Can a program get better without the aid of a teacher? It can, by generalizing from
its own experiences.

Learning by Parameter Adjustment

Many programs rely on an evaluation procedure that combines information from several sources
into a single summary statistic. Game-playing programs do this in their static evaluation functions,
in which a variety of factors, such as piece advantage and mobility, are combined into a single
score reflecting the desirability of a particular board position. Pattern classification programs
often combine several features to determine the correct category into which a given stimulus
should be placed. In designing such programs, it is often difficult to know a priori how much weight
should be attached to each feature being used. One way of finding the correct weights is to begin
with some estimate of the correct settings and then to let the program modify the settings on the
basis of its experience. Features that appear to be good predictors of overall success will have
their weights increased, while those that do not will have their weights decreased, perhaps even
to the point of being dropped entirely.

Samuel's checkers program exploited this kind of learning in addition to the rote learning
described above, and it provides a good example of its use. As its static evaluation function, the
program used a polynomial of the form

     C1tl + C2t2 + ...+ C16t16

The t terms are the values of the sixteen features that contribute to the evaluation. The terms are
the coefficients (weights) that are attached to each of these values. As learning progresses, the
values will change.

The most important question in the design of a learning program based on parameter adjustment
is "When should the value of a coefficient be increased and when should it be decreased?" The
second question to be answered is then "By how much should the value be changed? The simple
answer to the first question is that the coefficients of terms that predicted the final outcome
accurately should be increased, while the coefficients of poor predictors should be decreased. In
some domains, this is easy to do. If a pattern classification program uses its evaluation function to
classify an input and it gets the right answer, then all the terms that predicted that answer should
have their weights increased. But in game-playing programs, the problem is more difficult. The
program does not get any concrete feedback from individual moves. It does not find out for sure
186                                                                              ARTIFICIAL INTELLIGENCE


until the end of the game whether it has won. But many moves have contributed to that final
outcome. Even if the program wins, it may have made some bad moves along the way. The
problem of appropriately assigning responsibility to each of the steps that led to a single outcome
is known as the credit assignment problem.

Samuel's program exploits one technique, albeit imperfect, for solving this problem. Assume that
the initial values chosen for the coefficients are good enough that the total evaluation function
produces values that are fairly reasonable measures of the correct score even if they are not as
accurate as we hope to get them. Then this evaluation function can be used to provide feedback
to itself. Move sequences that lead to positions with higher values can be considered good (and
the terms in the evaluation function that suggested them can be reinforced).

Because of the limitations of this approach, however, Samuel's program did two other things, one
of which provided an additional test that progress was being made and the other of which
generated additional nudges to keep the process out of a rut.

When the program was in learning mode, it played against another copy of itself. Only one of the
copies altered its scoring function during the game; the other remained fixed. At the end of the
game, if the copy with the modified function won, then the modified function was accepted.
Otherwise, the old one was retained. If, however, this happened very many times, then some
drastic change was made to the function in an attempt to get the process going in a more
profitable direction.

Periodically, one term in the scoring function was eliminated and replaced by another. This was
possible because, although the program used only sixteen features at anyone time, it actually
knew about thirty-eight. This replacement differed from the rest of the learning procedure since it
created a sudden change in the scoring function rather than a gradual shift in its weights.

This process of learning by successive modifications to the weights of terms in a scoring function
has many limitations, mostly arising out of its lack of exploitation of any knowledge about the
structure of the problem with which it is dealing and the logical relationships among the problem's
components. In addition, because the learning procedure is a variety of hill climbing, it suffers
from the same difficulties, as do other hill-climbing programs. Parameter adjustment is certainly
not a solution to the overall learning problem. But it is often a useful technique, either in situations
where very little additional knowledge is available or in programs in which it is combined with
more knowledge-intensive methods.
LEARNING                                                                                          187


Learning with Macro-Operators

We saw that rote learning was used in the context of a checker-playing program. Similar
techniques can be used in more general problem-solving programs. The idea is the same: to
avoid expensive recomputation. For example, suppose you are faced with the problem of getting
to the downtown post office. Your solution may involve getting in your car, starting it, and driving
along a certain route. Substantial planning may go into choosing the appropriate route, but you
need not plan about how to go about starting your car. You are free to treat START-CAR as an
atomic action, even though it really consists of several actions: sitting down, adjusting the mirror,
inserting the key, and turning the key. Sequences of actions that can be treated as a whole are
called macro-operators.

Macro-operators were used in the early problem-solving system STRIPS. After each problem-
solving episode, the learning component takes the computed plan and stores it away as a macro-
operator, or MACROP. A MACROP is just like a regular operator except that it consists of a
sequence of actions. Not just a single one. A MACROP's preconditions are the initial conditions of
the problem just solved and its postconditions correspond to the goal just achieved. In its simplest
form, the caching of previously computed plans is similar to rote learning.

Suppose we are given an initial blocks world situation in which ON(C, B) and ON(A, Table) are
both true. STRIPS can achieve the goal ON(A, B) by devising a plan with the four steps
UNSTACK(C, B). PUTDOWN(C), PICKUP (A), STACK(A, B). A STRIP now builds a MACROP
with preconditions ON(C, B). ON(A, Table) and postconditions ON(C, Table). ON(A, B). The body
of the MACROP consists of the four steps just mentioned. In future planning, STRIPS is free to
use this complex macro-operator just as it would use any other operator.

But rarely will STRIPS see the exact same problem twice. New problems will differ from previous
problems. We would still like the problem solver to make efficient use of the knowledge it gained
from its previous experiences. By generalizing MACROPs before storing them, STRIPS is able to
accomplish this. The simplest idea for generalization is to replace all of the constants in the
macro-operator by variables. Instead of storing the MACROP described in the previous
paragraph, STRIPS can generalize the plan to consist of the steps UNSTACK(X1.X2).
PUTDOWN(X1). PICKUP(X3), STACK(X3, X2). where X1. X2. and X3 are variables. This plan can
then be stored with preconditions ON(X1, X2), ON(X3. Table) and postconditions ON(X1, Table).
ON(X2, X3). Such a MACROP can now apply in a variety of situations.
188                                                                               ARTIFICIAL INTELLIGENCE


Generalization is not so easy. Sometimes constants must retain their specific values. Suppose
our domain included an operator called STACK-ON-B(X). With preconditions that both X and B
be clear, and with postcondition ON(X, B).

STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK- ~ ON-B(A). Let's
generalize this plan and store it as a MACROP. The precondition " becomes ON(X3, X2), the
postcondition becomes ON(X1, X2), and the plan itself becomes UNSTACK(X3, X2),
PUTDOWN(X3), STACK-ON-B(X(). Now, suppose we encounter a slightly different problem:

The generalized MACROP we just stored seems well-suited to solving this problem if we let X1 =
A, X2 = C, and X3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C),
PUTDOWN(E), STACK-ON-B(A). But this plan does not work. The problem is that the post
condition of the MACROP is over generalized. This operation is only useful for stacking blocks
onto B, which is not what we need in this new example. In this case, this difficulty will be
discovered when the last step is attempted. Although we cleared C, which is where we wanted to
put A, we failed to clear B, which is where the MACROP is going to try to put it. Since B is not
clear, STACK-ON-B cannot be executed, If B had happened to be clear, the MACROP would
have executed to completion, but it would not have accomplished the stated goal.

In reality, STRIPS uses a more complex generalization procedure. First, all constants are
replaced by variables. Then, for each operator in the parameterized plan, STRIPS reevaluates its
preconditions. In our example, the preconditions of steps I and 2 are satisfied, but the only way to
ensure that B is clear for step 3 is to assume that block X2, which was cleared by the UNSTACK
operator, is actually block B. Through "re- proving" that the generalized plan works, STRIPS
locates constraints of this kind.

It turns out that the set of problems for which macro-operators are critical are exactly those
problems with nonserializable subgoals. Nonserializability means that working on one subgoal will
necessarily interfere with the previous solution to another subgoal. Recall that we discussed such
problems in connection with nonlinear planning. Macro- operators can be useful in such cases,
since one macro-operator can produce a small global change in the world, even though the
individual operators that make it up produce many undesirable local changes.

For example, consider the 8-puzzle. Once a program has correctly placed the first four tiles, it is
difficult to place the fifth tile without disturbing the first four. Because disturbing previously solved
subgoals is detected as a bad thing by heuristic scoring, functions, it is strongly resisted. For
LEARNING                                                                                                189


many problems, including the 8-puzzle and Rubik's cube, weak methods based on heuristic
scoring are therefore insufficient. Hence, we either need domain-specific knowledge, or else a
new weak method. Fortunately, we can learn the domain-specific knowledge we need in the form
of macro-operators. Thus, macro-operators can be viewed as a weak method for learning. In the
8-puzzle, for example, we might have a macro a complex, prestored sequence of operators-for
placing the fifth tile without disturbing any of the first four tiles externally (although in fact they are
disturbed within the macro itself). This approach contrasts with STRIPS, which learned its
MACROPs gradually, from experience. Korf's algorithm runs in time proportional to the time it
takes to solve a single problem without macro-operators.

Learning by Chunking

Chunking is a process similar in flavor to macro-operators. The idea of chunking comes from the
psychological literature on memory and problem solving. SOAR also exploits chunking so that its
performance can increase with experience. In fact, the designers of SOAR hypothesize that
chunking is a universal learning method, i.e., it can account for all types of learning in intelligent
systems.

SOAR solves problems by firing productions, which are stored in long-term memory. Some of
those firings turn out to be more useful than others. When SOAR detects a useful sequence of
production firings, it creates a chunk, which is essentially a large, production that does the work of
an entile sequence of smaller ones. As in MACROPs, chunks are generalized before they are
stored.

Problems like choosing which subgoals to tackle and which operators to try (i.e., search control
problems) are solved with the same mechanisms as problems in the original problem space.
Because the problem solving is uniform, chunking can be used to learn general search control
knowledge in addition to operator sequences. For example, if SOAR tries several different
operators, but only one leads to a useful path in the search space, then SOAR builds productions
that help it choose operators more wisely in the future. SOAR has used chunking to replicate the
macro-operator results described in the last section. In solving the 8-puzzle, for example, SOAR
learns how to place a given tile without permanently disturbing the previously placed tiles. Given
the way that SOAR learns, several chunks may encode a single macro-operator, and one chunk
may participate in a number of macro sequences. Chunks are generally applicable toward any
goal state. This contrasts with macro tables, which are structured towards reaching a particular
goal state from any initial state. Also, chunking emphasizes how learning can occur during
190                                                                                ARTIFICIAL INTELLIGENCE


problem solving, while macro tables are usually built during a preprocessing stage. As a result,
SOAR is able to learn within trials as well as across trials. Chunks learned during the initial stages
of solving a problem are applicable in the later stages of the same problem-solving episode. After
a solution is found, the chunks remain in memory, ready for use in the next problem.

The price that SOAR pays for this generality and flexibility is speed. At present, chunking is
inadequate for duplicating the contents of large, directly computed macro-operator tables.

The Utility Problem

PRODIGY employs several learning mechanisms. One mechanism uses explanation-based learning
(EBL), a learning method we discuss in PRODIGY can examine a trace of its own problem-
solving behavior and try to explain why certain paths failed. The program uses those explanations
to formulate control rules that help the problem solver avoid those paths in the future. So while




SOAR learns primarily from examples of successful problem solving, PRODIGY also learns from
its failures.

A major contribution of the work on EBL in PRODIGY was the identification of the utility problem in
learning systems. While new search control knowledge can be of great benefit in solving future
problems efficiently, there are also some drawbacks. The learned control rules can take up large
amounts of memory and the search program must take the time to consider each rule at each
step during problem solving. Considering a control rule amounts to seeing if its postconditions are
desirable and seeing if its preconditions are satisfied. This is a' time-consuming process. So while
learned rules may reduce problem-solving time by directing the search more carefully, they may
also increase problem-solving time by forcing the problem solver to consider them. If we only
want to minimize the number of node expansions in the search space, then the more control rules
we learn, the better. But if we want to minimize the total CPU time required to solve a problem,
we must consider this trade-off.

PRODIGY maintains a utility measure for each control rule. This measure takes into account the
average savings provided by the rule, the frequency of its application, and the cost of matching it.
If a proposed rule has a negative utility, it is discarded (or "forgotten"). If not, it is placed in long-
term memory with the other rules. It is then monitored during subsequent problem solving. If its
utility falls, the rule is discarded. Empirical experiments have demonstrated the effectiveness of
keeping only those control rules with high utility. Utility considerations apply to a wide range of AI
LEARNING                                                                                          191


learning systems. For example, for a discussion of how to deal with large, expensive chunks in
SOAR.

Top

Learning by Induction

Classification is the process of assigning, to a particular input, the name of a class to which it
belongs. The classes from which the classification procedure can choose can be described in a
variety of ways. Their definition will depend on the use to which they will be put.

Classification is an important component of many problem-solving tasks. In its simplest form, it is
presented as a straightforward recognition task. An example of this is the question "What letter of
the alphabet is this?" But often classification is embedded inside another operation. To see how
this can happen, consider a problem-solving system that contains the following production rule:
      If: the current goal is to get from place A to place B, and

      there is a WALL separating the two places

      then: look for a DOORWAY in the WALL and go through it.

To use this rule successfully, the system's matching routine must be able to identify an object as
a wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must be
able to recognize a doorway.

Before classification can be done, the classes it will use must be defined. This can be done in a
variety of ways, including:

Isolate a set of features that are relevant to the task domain. Define each class by a weighted
sum of values of these features. Each class is then defined by a scoring function that looks very
similar to the scoring functions often used in other situations, such as game playing. Such a
function has the form.

      C1t1 + C2V2 + C3t3 + ...

Each t corresponds to a value of a relevant parameter, and each c represents the weight to be
attached to the corresponding t. Negative weights can be used to indicate features whose
presence usually constitutes negative evidence for a given class.
192                                                                             ARTIFICIAL INTELLIGENCE


For example, if the task is weather prediction, the parameters can be such measurements as
rainfall and location of cold fronts. Different functions can be written to combine these parameters
to predict sunny, cloudy, rainy, or snowy weather.

Isolate a set of features that are relevant to the task domain. Define each class as a structure
composed of those features. For example, if the task is to identify animals, the body of each type
of animal can be stored as a structure, with various features representing such things as color,
length of neck, and feathers.

There are advantages and disadvantages to each of these general approaches. The statistical
approach taken by the first scheme presented here is often more efficient than the structural
approach taken by the second. But the second is more flexible and more extensible.

Regardless of the way that classes are to be described, it is often difficult to construct, by hand,
good class definitions. This is particularly true in domains that are not well understood or that
change rapidly. Thus the idea of producing a classification program that can evolve its own class
definitions is appealing. This task of constructing class definitions is called concept learning, or
induction. The techniques used for this task J must, of course, depend on the way that classes
(concepts) are described. If classes are described by scoring functions, then concept learning can
be done using the technique of coefficient adjustment. If, however, we want to define classes
structurally, some other technique for learning class definitions is necessary. In this section, we
present three such techniques.

Winston's Learning Program

Winston describes an early structural concept-learning program. This program operated in a
simple blocks world domain. Its goal was to construct representations of the definitions of
concepts in the blocks domain. For example, it learned the concepts: House, Tent, and Arch shown
in Figure 7.2. The figure also shows an example of a near miss for each concept. A near miss is an
object that is not an instance of the concept in question but that is very similar to such instances.

The program started with a line drawing of a blocks world structure. It used procedures to analyze
the drawing and construct a semantic net representation of the structural description of the
object(s). This structural description was then provided as input to the learning program. An
example of such a structural description for the House of Figure 7.2 is shown in Figure 7.3(a). Node
A represents the entire structure, which is composed of two parts: node B, a Wedge, and node C, a
Brick. Figures 7.3(b) and 7.3(c) show descriptions of the two Arch structures of Figure 7.2. These
LEARNING                                                                                          193


descriptions are identical except for the types of the objects on the top, one is a Brick while the
other is a Wedge. Notice that the two supporting objects are related not only by left-of and right-of
links, but also by a does-not-marry link, which says that the two objects do not marry. Two objects
marry if they have faces that touch, and they have a common edge. The marry relation is critical in
the definition of an Arch. It is the difference between the first arch structure and the near miss




arch structure shown in Figure 7.2.
                               Figure 7.2: Some Blocks World Concepts


The basic approach that Winston's program took to the problem of concept formation can be
described as follows:

1. Begin with a structural description of one known instance of the concept. Call that description
     the concept definition.

2. Examine descriptions of other known instances of the concept. Generalize the definition to
     include them.

3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these.

     Steps 2 and 3 of this procedure can be interleaved.

Steps 2 and 3 of this procedure rely heavily on a comparison process by which similarities and
differences between structures can be detected. This process must function in much the same
way as does any other matching process, such as one to determine whether a given production
rule can be applied to a particular problem state. Because differences as well as similarities must
194                                                                       ARTIFICIAL INTELLIGENCE


be found. The procedure must perform not just literal but also approximate matching. The output
of the comparison procedure is a skeleton structure describing the commonalities between the
two input structures. It is annotated with a set of comparison notes that describe specific
similarities and differences between the inputs.
LEARNING                                                                                           195


To see how this approach works, we trace it through the process of learning what an arch is.
Suppose that the arch description of Figure 7 .3(b) is presented first. It then becomes the definition




of the concept Arch. Then suppose that the arch description of Figure 7.3(c) is presented. The
196                                                                                 ARTIFICIAL INTELLIGENCE


comparison routine will return a structure similar to the two input structures except that it will note
that the objects represented by the nodes labeled C are not identical. This structure is shown as
Figure 7.4. The c-note link from node C describes the difference found by the comparison routine.
It notes that the difference occurred in the isa link, and that in the first structure the isa link pointed
to Brick, and in the second it pointed to Wedge. It also notes that if we were to follow isa links from
Brick and Wedge, these links would eventually merge. At this point, a new, description of the
concept Arch can be generated. This description could say simply that node C must be either a
Brick or a Wedge. But since this particular disjunction has no previously known significance, it is
probably better to trace up the isa hierarchies of Brick and Wedge until they merge. Assuming that
that happens at the node Object, the Arch definition shown in Figure 7.4 can be built.
                                  Figure 7.3: The Structural Description




                                                Figure 7.4




                                                Figuer 7.4
LEARNING                                                                                             197




                          Figure 7.5: The Arch Description after Two Examples


Next, suppose that the near miss arch is presented. This time, the comparison routine will note
that the only difference between the current definition and the near miss is in the does-not-marry link
between nodes B and D. But since this is a near miss, we do not want to broaden the definition to
include it. Instead, we want to restrict the definition so that it is specifically excluded. To do this,
we modify the link does-not-marry, which may simply be recording something that has happened by
chance to be true of the small number of examples that have been presented. It must now say
must-not-marry. Actually, must-not-marry should not be a completely new link. There must be some
structure among link types to reflect the relationships between marry, does-not-marry, and must-not-
marry.

Notice how the problem-solving and knowledge representation techniques we covered in earlier
chapters are brought to bear on the problem of learning. Semantic networks were used to
describe block structures, and an isa hierarchy was used to describe relationships among already
known objects. A matching process was used to detect similarities and differences between
structures, and hill climbing allowed the program to evolve a more and more accurate concept
definition.

This approach to structural concept learning is not without its problems. One major problem is
that a teacher must guide the learning program through a carefully chosen sequence of
examples. In the next section, we explore a learning technique that is insensitive to the order in
which examples are presented.

Example




        Near Misses:
198                                                                              ARTIFICIAL INTELLIGENCE


Goal

Winston's Arch Learning Program models how we humans learn a concept such as "What is an
Arch"?

This is done by presenting to the "student" (the program) examples and counter examples (which
are however very similar to examples). The latter are called "near misses."

The program builds (Winston's version of) a semantic network. With every example and with
every near miss, the semantic network gets better at predicting what exactly an arch is.




Learning Procedure W

1.    Let the description of the first sample which must be an example, be the initial description.

2.    For all other samples.

      -    If the sample is a near-miss call SPECIALIZE

      -    If the sample is an example call GENERALIZE

Specialize

•     Establish parts of near-miss to parts of model matching

•     IF there is a single most important difference THEN:

      –    If the model has a link that is not in the near miss, convert it into a MUST link

      –    If the near miss has a link that is not in the model, create a MUST-NOT-LINK, ELSE
           ignore the near-miss.

Generalize

1.    Establish a matching between parts of the sample and the model.

2.    For each difference:
LEARNING                                                                                        199


    –      If the link in the model points to a super/sub-class of the link in the sample a MUST-BE
           link is established at the join point.

    –      If the links point to classes of an exhaustive set, drop the link from the model.

    –      Otherwise create a new common class and a MUST-BE arc to it.

    –      If a link is missing in model or sample, drop it in the other.

    –      If numbers are involved, create an interval that contains both numbers.

    –      Otherwise, ignore the difference.




Because the only difference is the missing support, and it is near miss, that are must be
important.

New Model
200                                                                          ARTIFICIAL INTELLIGENCE


This uses specialize.(Actually it uses it twice, this is an exception).

However, the following principles may be applied.

1.    The teacher must use near misses and examples judiciously.

2.    You cannot learn if you cannot know (=> KR)

3     You cannot know if you cannot distinguish relevant from irrelevant features. (=> Teachers
      can help here, by new missing.)

4.    No-Guessing Principle: When there is doubt what to learn, learn nothing.

5.    Learning Conventions called "Felicity Conditions" help. The teacher knows that there should




      be only one relevant difference, and the Student uses it.

6.    No-Altering Principle: If an example fails the model, create an exception. (Penguins are
      birds).

7.    Martin's Law: You can't learn anything unless you almost know it already.

Learning Rules from Examples
LEARNING                                                                                             201


Given are 8 instances:

1.   DECCG

2.   AGDBC

3.   GCFDC

4.   CDFDE

5.   CEGEC
(Comment: "These are good")

1.   CGCGF

2.   DGFCD

3.   ECGCD
(Comment: "These are bad")

The following rules perfectly describes warnings:

(M(1,C) ∧ M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F)) ∨

(M(1,D) ∧ M(2,G) ∧ M(3,F) ∧ M(4,C) ∧ M(5,D)) ∨

(M(1,E) ∧ M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) ->Warning

Now we eliminate the first conjunct from all 3 disjuncts:

(M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F) ) ∨

(M(2,G) ∧ M(3,F) ∧ m(4,C) ∧ M(5,D)) ∨

(M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) -> Warning

Now we check whether any of the good strings 1 - 5 would cause a warning.

1, 4, 5 are out at the first position. (Actually it is the second now). 2, 3 are out at the next (third)
position.

Thus, I don't really need the first conjunct in the Warning rule!

I keep repeating this dropping business, until I am left with

M(5,F) V M(5,D)-> warning
202                                                                              ARTIFICIAL INTELLIGENCE


This rule contains all the knowledge contained in the examples!

Note that this elimination approach is order dependant!! Working from right to left you would get:

(M(1,C) ∧ M(2,6)) ∨ ( M(1,D) ∧ M(2,6)) ∨ M(1,E) -> warning

Therefore the resulting rule is order dependent! There are N! (factorial!!!) orders we could try to
delete features! Also there is no guarantee that the rules will work correctly on new instances. In
fact rules are not even guaranteed to exist!

To deal with N! problem, try to find "higher level features". For instance, if 2 features always occur
together, replace them by one feature.

                                     Student Activity 7.2

Before reading next section, answer the following questions.

1.    What do you mean by learning by induction. Explain with example?

2.    Describe learning by Macro Operators?

3.    Explain why a good knowledge representation is useful in reasoning knowledge.

If your answers are correct, then proceed to the next section.

Top

Explanation-Based Learning

The previous section illustrated how we can induce concept descriptions from positive and
negative examples. Learning complex concepts using these procedures typically requires a
substantial number of training instances. But people seem to be able to learn quite a bit from
single examples. Consider a chess player who, as Black, has reached the position shown in
Figure 7.6. The position is called a "fork" because the white knight attacks both the black king and
the black queen. Black must move the king, thereby leaving the queen open to capture. From this
single experience, Black is able to learn quite a bit about the fork trap: the idea is that if any piece
x attacks both the opponent's king and another piece y, then piece y will be lost. We don't need to
see dozens of positive and negative examples of fork positions in order to draw these
conclusions. From just one experience, we can learn to avoid this trap in the future and perhaps
to use it to our own advantage.
LEARNING                                                                                         203


What makes such single-example learning possible? The answer, not surprisingly, is knowledge.
The chess player has plenty of domain-specific knowledge that can be brought to bear, including
the rules of chess and any previously acquired strategies. That knowledge can be used to identify
the critical aspects of the training example. In the case of the fork, we know that the double
simultaneous attack is important while the precise position and type of the attacking piece is not.

Much of the recent work in machine learning has moved away from the empirical, data-intensive
approach described in the last section toward this more analytical, knowledge-intensive approach.
A number of independent studies led to the characterization of this approach as explanation-based
learning. An EBL system attempts to learn from a single example x by explaining why x is an
example of the target concept. The explanation is then generalized, and the system's
performance is improved through the availability of this knowledge.




                                 Figure 7.6: A Fork Position in Chess


Mitchell et al. and DeJong and Mooney both describe general frame works for EBL programs and
give general learning algorithms. We can think of EBL programs as accepting the following as
input:

1.   A Training Example-What the learning program "sees" in the world?

2.   A Goal Concept-A high-level description of what the program is supposed to learn

3.   An Operationality Criterion-A description of which concepts are usable

4.   A Domain Theory-A set of rules that describe relationships between objects and actions in a
     domain.
204                                                                             ARTIFICIAL INTELLIGENCE


From this, EBL computes a generalization of the training example that is sufficient to describe the
goal concept, and also satisfies the operationality criterion. Let's look more closely at this
specification. The training example is a familiar input-it is the same thing as the example in the
version space algorithm. The goal concept is also familiar, but in previous sections, we have
viewed the goal concept as an ~ output of the program, not an input. The assumption here is that
the goal concept is not operational, just like the high-level card-playing advice. An EBL program
seeks to operationalize the goal concept by expressing it in terms that a problem-solving program
can understand. These terms are given by the operationality criterion. In the chess example, the
goal concept might be something like "bad position for Black," and the operationalized concept
would be a generalized description of situations similar to the training example, given in terms of
pieces and their relative positions. The last input to an EBL program is a domain theory, in our
case, the rules of chess. Without such knowledge, it is impossible to come up with a correct
generalization of the training example.

Explanation-based generalization (EBG) is an algorithm for EBL described in Mitchell et al. It has two
steps: (1) explain and (2) generalize. During the first step, the domain theory is used to prune
away all the unimportant aspects of the training example with respect to the goal concept. What is
left is an explanation of why the training example is an instance of the goal concept. This
explanation is expressed in terms that satisfy the operationality criterion. The next step is to
generalize the explanation as far as possible while still describing the goal concept. Following our
chess example, the first EBL step chooses to ignore White's pawns, king, and rook, and
constructs an explanation consisting of White's knight, Black's king, and Black's queen, each in
their specific positions. Operationality is ensured: all chess-playing programs understand the
basic concepts of piece and position. Next, the explanation is generalized. Using domain
knowledge, we find that moving the pieces to a different part of the board is still bad for Black. We
can also determine that other pieces besides knights and queens can participate in fork attacks.

In reality, current EBL methods run into difficulties in domains as complex as chess, so we will not
pursue this example further. Instead, let's look at a simpler case. Consider the problem of
learning the concept Cup. Unlike the arch-learning program we want to be able to generalize from
a single example of a cup. Suppose the example is:

Training Example:

owner(Object23, Ralph) ∧ has-part(Object23, Concavity12) ∧
LEARNING                                                                                           205


is(Object23, Light) ∧ color(Object23, Brown) ∧ ...

Clearly, some of the features of Object23 are more relevant to its being a cup than others. So far in
this unit we have seen several methods for isolating relevant features. All these methods require
many positive and negative examples. In EBL we instead rely on domain knowledge, such as:

Domain Knowledge:

is(x, Light) ∧ has-part(x, y) ∧ isa(y, Handle) → liftable(x)

has-part(x, y) ∧ isa(y, Bottom) ∧ is(y, Flat) → stable(x)

has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) → open-vessel(x)

We also need a goal concept to operationalize:

Goal Concept: Cup

x is a Cup if x is liftable, stable, and open-vessel.

OperationalityCriterion: Concept definition must be expressed in purely structural terms (e.g.,
Light, Flat, etc.).

Given a training example and a functional description, we want to build a general structural
description of a cup. The first step is to explain why Object23 is a cup. We do this by constructing a
proof, as shown in Figure 7.7. Standard theorem-proving techniques can be used to find such a
proof. Notice that the proof isolates the relevant features of the training example; nowhere in the
proof do the predicates owner and color appear. The proof also serves as a basis for a valid
generalization. If we gather up all the assumptions and replace constants with variables, we get
the following description of a cup:

has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) ∧

has-part(x, z) ∧ isa(z, Bottom) ∧ is(z, Flat) ∧

has-part(x, w) ∧ isa(w, Handle) ∧ is(x, Light)

This definition satisfies the operationality criterion and could be used by a robot to classify
objects.
206                                                                              ARTIFICIAL INTELLIGENCE

                                       Figure 7.7: An Explanation




Simply replacing constants by variables worked in this example, but in some cases it is necessary
to retain certain constants. To catch these cases, we must reprove the goal. This process, which
we saw earlier in our discussion of learning in STRIPS, is called goal regression.

As we have seen, EBL depends strongly on a domain theory. Given such a theory, why are
examples needed at all? We could have operationalized the goal concept Cup without reference
to an example, since the domain theory contains all of the requisite information. The answer is
that examples help to focus the learning on relevant operationalizations. Without an example cup,
EBL is faced with the task of characterizing the entire range of objects that satisfy the goal
concept. Most of these objects will never be encountered in the real world, and so the result will
be overly general.

Providing a tractable domain theory is a difficult task. There is evidence that humans do not learn
with very primitive relations. Instead, they create incomplete and inconsistent domain theories.
For example, returning to chess, such a theory might " include concepts like "weak pawn
structure." Getting EBL to work in ill-structured domain theories is an active area of research.

EBL shares many features of all the learning methods described in earlier sections. , Like concept
learning, EBL begins with a positive example of some concept. As in learning by advice taking,
the goal is to operationalize some piece of knowledge. And EBL techniques, like the techniques
of chunking and macro-operators, are often used to improve the performance of problem-solving
engines. The major difference between EBL and other learning methods is that EBL programs are
built to take advantage of domain knowledge. Since learning is just another kind of problem
solving, it should come as no surprise that there is leverage to be found in knowledge.
LEARNING                                                                                        207


Top

Learning Automation

The theory of learning automata was first introduced in 1961 (Tsetlin, 1961). Since that time these
systems have been studied intensely, both analytically and through simulations (Lakshmivarahan,
1981). Learning automata systems are finite set adaptive systems which interact iteratively with a
general environment. Through a probabilistic trial-and-error response process they learn to
choose or adapt to a behaviour that produces the best response. They are, essentially, a form of
weak, inductive learners.

In Figure 7.8, we see that the learning model for learning automata has been simplified for just
two components, an automaton (learner) and an environment. The learning cycle begins with an
input to the learning automata system from the environment. This input elicits one of a finite
number of possible responses and then provides some form of feedback to the automaton in
return. The automaton to alter its stimulus-response mapping structure to improve its behaviour in
a more favourable way uses this feedback.

As a simple example, suppose a learning automata is being used to learn the best temperature
control setting for your office each morning. It may select any one of ten temperature range
settings at the beginning of each day (Figure 7.9). Without any prior knowledge of your
temperature preferences, the automaton randomly selects a first setting using the probability
vector corresponding to the temperature settings.




                               Figure 7.8: Learning Automaton Model
208                                                                            ARTIFICIAL INTELLIGENCE




                                 Figure 7.9: Temperature Control Model


Since the probability values are uniformly distributed, any one of the settings will be selected with
equal likelihood. After the selected temperature has stabilized, the environment may respond with
a simple good-bad feedback response. If the response is good, the automata will modify its
probability vector by rewarding the probability corresponding to the good setting with a positive
increment and reducing all other probabilities proportionately to maintain the sum equal to 1. If the
response is bad, the automaton will penalize the selected setting by reducing the probability
corresponding to the bad setting and increasing all other values proportionately. This process is
repeated each day until the good selections have high probability values and all bad choices have
values near zero. Thereafter, the system will always choose the good settings. If, at some point,
in the future your temperature preferences change, the automaton can easily readapt.

Learning automata have been generalized and studied in various ways. One such generalization
has been given the special name of collective learning automata (CLA). CLAs are standard
learning automata systems except that feedback is not provided to the automaton after each
response. In this case, several collective stimulus-response actions occur before feedback is
passed to the automaton. It has been argued (Bock, 1976) that this type of learning more closely
resembles that of human beings in that we usually perform a number or group of primitive actions
before receiving feedback on the performance of such actions, such as solving a complete
problem on a test or parking a car. We illustrate the operation of CLAs with an example of
learning to play the game of Nim in an optimal way.

Nim is a two-person zero-sum game in which the players alternate in removing tokens from an
array that initially has nine tokens. The tokens are arranged into three rows with one token in the
first row, three in the second row, and five in the third row (Figure 7.10).
LEARNING                                                                                          209

                                 Figure 7.10: Nim Initial Configuration


The first player must remove at least one token but not more than all the tokens in any single row.
Tokens can only be removed from a single row during each payer’s move. The second player
responds by removing one or more tokens remaining in any row. Players alternate in this way
until all tokens have been removed; the loser is the player forced to remove the last token.

We will use the triple (n1, n2, n3) to represent the states of the game at a given time where n 1, n2
and n3 are the numbers of tokens in rows 1, 2, and 3, respectively. We will also use a matrix to
determine the moves made by the CLA for any given state. The matrix of Figure 7.11 has heading
columns which correspond to the state of the game when it is the CLA’s turn to move, and row
headings which correspond to the new game state after the CLA’s turn to move, and row
headings which correspond to the new game state after the CLA has completed a move.
Fractional entries in the matrix are transition probabilities used by the CLA to execute each of its
moves. Asterrisks in the matrix represent invalid moves.

Beginning with the initial state (1, 3, 5), suppose the CLA’s opponent removes two tokens from
the third row resulting in the new state (1, 3, 3). If the ClA then removes all three tokens from the
second row, the resultant state is (1, 0, 3). Suppose the opponent now removes all remaining
tokens from the third row. This leaves the CLA with a losing configuration of (1, 0, 0).




                    Figure 7.11: CLA Internal Representation of Game States


As the start of the learning sequence, the matrix is initialized such that the elements in each
column are equal (uniform) probability values. For example, since there are eight valid moves
210                                                                          ARTIFICIAL INTELLIGENCE


from the state (1, 3, 4) each column element under this state corresponding to a valid move has
been given uniform probability values corresponding to all valid moves for the given column state.

The CLA selects moves probabilistically using the probability values in each column. So, for
example, if the CLA had the first move, any row intersecting with the first column not containing
                                               1
an asterisk would be chosen with probability 9 . This choice then determines the new game

state from which the opponent must select a move. The opponent might have a similar matrix to
record game states and choose moves. A complete game is played before the CLA is given any
feedback, at which time it is informed whether or not its responses were good or bad. This is the
collective feature of the CLA.

If the CLA wins a game, increasing the probability value in each column corresponding to the
winning move rewards all moves made by the CLA during that game. All non-winning probabilities
in those columns are reduced equally to keep the sum in each column equal to 1. If the CLA loses
a game, reducing the probability values corresponding to each losing move penalizes the moves
leading to that loss. All other probabilities in the columns having a losing move are increased
equally to keep the column totals equal to 1.

After a number of games have been played by the CLA, the matrix elements that correspond to
repeated wins will increase toward one, while all other elements in the column will decrease
toward zero. Consequently, the CLA will choose the winning moves more frequently and thereby
improve its performance.

Simulated games between a CLA and various types of opponents have been performed and the
results plotted (Bock, 1985). It was shown, for example, that two CLAs playing against each other
required about 300 games before each learned to play optimally. Note, however, that
convergence to optimality can be accomplished with fewer games if the opponent always plays
optimally (or poorly), since, in such a case, the CLA will repeatedly lose (win) and quickly reduce
(increase) the losing (winning) move elements to zero (one). It is also possible to speed up the
learning process through the use of other techniques such as learned heuristics.

Learning systems based on the learning automaton or CLA paradigm are fairly general for
applications in which a suitable state representation scheme can be found. They are also quite
robust learners. In fact, it has been shown that an LA will converge to an optimal distribution
under fairly general conditions if the feedback is accurate with probability greater 0.5 (Narendra
LEARNING                                                                                          211


and Thathachar, 1974). Of course, the rate of convergence is strongly dependent on the reliability
of the feedback.

Learning automata are not very efficient learners as was noted in the game-playing example
above. They are, however, relatively easy to implement, provided the number of states is not too
large. When the number of states becomes large, the amount of storage and the computation
required to update the transition matrix becomes excessive.

Potential applications for learning automata include adaptive telephone routing and control. Such
applications have been studied using simulation programs (Narendra et al., 1977).

Genetic Algorithms

Genetic algorithm learning methods are based on models of natural adaptation and evolution.
These learning systems improve their performance through processes that model population
genetics and survival of the fittest. They have been studied since the early 1960s (Holland, 1962,
1975).

In the field of genetics, a population is subjected to an environment that places demands on the
members. The members that adapt well are selected for mating and reproduction. The offspring
of these better performers inherit genetic traits from both their parents. Members of this second
generation of offspring, which also adapt well, are then selected for mating and reproduction and
the evolutionary cycle continues. Poor performers die off without leaving offspring. Good
performers produce good offspring and they, in turn, perform well. After some number of
generations, the resultant population will have adapted optimally or at least very well to the
environment.

Genetic algorithm systems start with a fixed size population of data structures that are used to
perform some given tasks. After requiring the structures to execute the specified tasks some
number of times, the structures are rated on their performance, and a new generation of data
structures is then created. Mating the higher performing structures to produce offspring creates
the new generation. These offspring and their parents are then retained for the next generations
while the poorer performing structures are discarded. The basic cycle is illustrated in Figure 7.12.

Mutations are also performed on the best performing structures to insure that the full space of
possible structures is reachable. This process is repeated for a number of generations until the
resultant population consists of only the highest performing structures.
212                                                                            ARTIFICIAL INTELLIGENCE


Data structures that make up the population can represent rules or any other suitable types of
knowledge structure. To illustrate the genetic aspects of the problem, assume for simplicity that
the populations of structures are fixed-length binary strings such as the eight-bit string 11010001.
An initial population of these eight-bit strings would be generated randomly or with the use of
heuristics at time zero. These strings, which might be simple condition an action rules, would then
be assigned some tasks to perform (like predicting the weather based on certain physical and
geographic conditions or diagnosing a fault in a piece of equipment).




                                    Figure 7.12: Genetic Algorithm


After multiple attempts at executing the tasks, each of the participating structures would be rated
and tagged with a utility value, u, commensurate with its performance. The next population would
then be generated using the higher performing structures as parents and the process would be
repeated with the newly produced generation. After many generations the remaining population
structures should perform the desired tasks well.

Mating between two strings is accomplished with the crossover operation which randomly selects
a bit position in the eight-bit string and concatenates the head of one parent to the tail of the
second parent to produce the offspring. Suppose the two parents are designated as xxxxxxxx and
yyyyyyyy respectively, and suppose the third bit position has been selected as the crossover point
(at the position of the colon in the structure xxx: xxxxx). After the crossover operation is applied,
LEARNING                                                                                            213


two offspring are then generated, namely xxxyyyyy and yyyxxxxx. Such offspring and their
parents are then used to make up the next generation of structures.

A second genetic operation often used is called inversion. Inversion is a transformation applied to
a single string. A bit position is selected at random, and when applied to a structure; the inversion
operation concatenates the tail of the string to the head of the same string. Thus, if the sixth
position were selected (x1x2x3x4x5x6x7x8), the inverted string would be x7x8x1x2x3x4x5x6.

A third operator, mutation, is used to insure that all locations of the rule space are reachable, that
every potential rule in the rule space is available for evaluation. This insures that the selection
process does not get caught in a local minimum. For example, it may happen that use of the
crossover and inversion operators will only produce a set of structures that are better than all
local neighbors but not optimal in a global sense. This can happen since crossover and inversion
may not be able to produce some undiscovered structures. The mutation operators can overcome
this by simply selecting any bit position in a string at random and changing it. This operator is
typically used only infrequently to prevent random wandering in the search space.

The genetic paradigm is best understood through an example. To illustrate similarities between
the learning automation paradigm and the genetic paradigm we use the same learning task of the
previous section, namely learning to play the game of Nim optimally. We use a slightly different
representation scheme here since we want a population of structures that are easily transformed.
To do this, we let each member of the population consist of a pair of triplets augmented with a
utility value u, ((n1, n2, n3) (m1, m2, m3) u), where the first pair is the game state presented to the
genetic algorithm system prior to its move, and the second triple is the state after the move. The u
values represent the worth or current utility of the structure at any given time.

Before the game begins, the genetic system randomly generates an initial population of K triple-
pair members. The population size K is one of the important parameters that must be selected.
Here, we simply assume it is about 25 or 30, which should be more than the number of moves
needed for any optimal play. All members are assigned an initial utility value of 0. The learning
process then proceeds as follows:

1.   The environment presents a valid triple to the genetic system.

2.   The genetic system searches for all population triple pairs that have a first triple that matches
     the input triple. From those that match, the first one found having the highest utility value u is
     selected and the second triple is returned as the new game state. If no match is found, the
214                                                                             ARTIFICIAL INTELLIGENCE


      genetic system randomly generates a triple which represents a valid move, returns this as
      the new state, and stores the triple pairs, the input, and newly generated triple as an addition
      to the population.

3.    The above two steps are repeated until the game is terminated, in which case the genetic
      system is informed whether a win or loss occurred. If the system wins, each of the
      participating member moves has its utility value increased. If the system loses, each
      participating member has its utility value decreased.

4.    The above steps are repeated until a fixed number of games have been played. At this time
      a new generation is created.

The new generation is created from the old population by first selecting a fraction (say one half) of
the members having the highest utility values. From these, offspring are obtained by application
of appropriate genetic operators.

The three operators, crossover, inversion, and mutation, randomly modify the parent moves to
give new offspring move sequences. (The best choice of genetic operators to apply in this
example is left as an exercise). Each offspring inherits a utility value from one of the parents.
Population members having low utility values are discarded to keep the population size fixed.

This whole process is repeated until the genetic system has learned all the optimal moves. This
can be determined when the parent population ceases to change or when the genetic system
repeatedly wins.

The similarity between the learning automaton and genetic paradigms should be apparent from
this example. Both rely on random move sequences, and the better moves are rewarded while
the poorer ones are penalized.




Neural Network

Neural networks are large networks of simple processing elements or nodes that process
information dynamically in response to external inputs. The nodes are simplified models of
neurons. The knowledge in a neural network is distributed throughout the network in the form of
internode connections and weighted links that form the inputs to the nodes. The link weights
serve to enhance or inhibit the input stimuli values that are then added together at the nodes. If
LEARNING                                                                                          215


the sum of all the inputs to a node exceeds some threshold value T, the node executes and
produces an output that is passed on to other nodes or is used to produce some output response.
In the simplest case, no output is produced if the total input is less than T. In more complex
models, the output will depend on a nonlinear activation function.

Neural networks were originally inspired as being models of the human nervous system. They are
greatly simplified models to be sure (neurons are known to be fairly complex processors). Even
so, they have been shown to exhibit many "intelligent" abilities, such as learning, generalization,
and abstraction.

A single node is illustrated in Figure 7.13. The inputs to the node are the values X 1, X2 . . . , Xn
which typically take on values of -1, 0, 1, or real values within the range (—1,1). The weights w1,
w2,. . . ., wn, correspond to the synaptic strengths of a neuron. They serve to increase or decrease
the effects of the corresponding x, input values. The sum of the products x i × wi i = 1,2,. . . , n,
serve as the total combined input to the node. If this sum is large enough to exceed the threshold
amount T, the node fires, and produces an output y, an activation function value placed on the
node’s output links. The output may then be the input to other nodes or the final output response
from the network.




                              Figure 7.13: Model of a single neuron (node)


Figure 7.14 illustrates three layers of a number of interconnected nodes. The first layer serves as
the input layer, receiving inputs from some set of stimuli. The second layer (called the hidden
layer layer) receive inputs from the first layer and produces a pattern of inputs to the third layer,
the output layer. The. pattern of outputs from the final layer are the network's responses to the
input stimuli patterns. Input links to layer j(j=1, 2, 3) have weights wij for I=1, 2,…….,n.

General multiplayer networks having n nodes (number of rows) in each of m layers (number of
columns of nodes) will have weights represented as an n x m matrix W. Using this representation,
nodes having no interconnecting links will have a weight value of zero. Networks consisting of
216                                                                           ARTIFICIAL INTELLIGENCE


more than three layers would, of course, be correspondingly more complex than the network
depicted in Figure 7.14.

A neural network can be thought of as a black box that transforms the input vector x to the output
vector y where the transformation performed is the result of the pattern of connections and
weights, that is, according to the values of the weight matrix W.

Consider the vector product
                                         | x × w | = ∑x w
                                                       i i

There is a geometric interpretation for this product. It is equivalent to projecting one vector onto
the other vector in n-dimensional space. This notion is depicted in Figure 7.15 for the two-
dimensional case. The magnitude of the resultant vector is given by
                                         x × w = x || w | cos θ
                                                |

here |x| denotes the norm or length of the vector x. Note that this product is maximum when both
vectors point in the same direction, that is, when θ = 0. The product is a minimum when both
point in opposite directions or when θ=180 degrees. This illustrates how the vectors in the
weight matrix W influence the inputs to the nodes in a neutral network.




                              Figure 7.14: A Multiplayer Neural Network
LEARNING                                                                                           217




                       Figure 7.15: Vector Multiplication is Like Vector Projecton




Learning pattern weights

The interconnections and weights W in the neural network store the knowledge possessed by the
network. These weights must be preset or learned in some manner. When learning is used, the
process may be either supervised or unsupervised. In the supervised case, learning is performed
by repeatedly presenting the network with an input pattern and a desired output response. The
training examples then consist of the vector pairs (x, y'), where x is the input pattern and y' is the
desired output response pattern. The weights are then adjusted until the difference between the
actual output response y and the desired response y' are the same, that is until D = y- y' is near
zero.

One of the simpler supervised learning algorithms uses the following formula to adjust the weights
W.
                                                           x
                                      Wnew = W + a × D ×
                                              old       | x| 2

Where 0 < a < 1 is a learning constant that determines the rate of learning. When the difference D
is large, the adjustment to the weights W is large, but when the output response y is close to the
target response y' the adjustment will be small. When the difference D is near zero, the training
process terminates at which point the network will produce the correct response for the given
input patterns x.

In unsupervised learning, the training examples consist of the input vectors x only. No desired
response y' is available to guide the system. Instead, the learning process must find the weights
w, with no knowledge of the desired output response.
218                                                                           ARTIFICIAL INTELLIGENCE


A neural net expert system

An example of a simple neural network diagnostic expert system has been described by Stephen
Gallant (1988). This system diagnoses and recommends treatments for acute sarcophagal
disease. The system is illustrated in Figure 7.16. From six symptom variables u 1, u2,……., u6, one
of two possible diseases can be diagnosed, u7 or ug. From the resultant diagnosis, one of three
treatments, u9, u10, or u11 can then be recommended.




                        Figure 7.16: A Simple Neural Network Expert System.


When a given symptom is present, the corresponding variable is given a value of +1 (true).
Negative symptoms are given an input value of -1 (false), and unknown symptoms are given the
value 0. Input symptom values are multiplied by their corresponding weights W i0 Numbers within
the nodes are initial bias weights Wi0 and numbers on the links are the other node input weights.
When the sum of the weighted products of the inputs exceeds 0, an output will be present on the
corresponding node output and serve as an input to the next layer of nodes.
LEARNING                                                                                         219


As an example, suppose the patient has swollen feet ⇐ +1) but not red ears (U2 = -1) nor hair
loss (U8 = -1). This gives a value of U 7 = +1 (since 0+(2)(l)+(-2)(-l)+(3)(-l) = 1), suggesting the
patient has superciliosis.

When it is also known that the other symptoms of the patient are false (U 5 = U6 = l), it may be
concluded that namatosis is absent (U8 = -1), and therefore that birambio (U10 = +1) should be
prescribed while placibin should not be prescribed (U9 = -1). In addition, it will be found that
posiboost should also be prescribed (u11 = +1).

The training algorithm added the intermediate triangular shaped nodes. These additional nodes
are needed so that weight assignments can be made which permit the computations to work
correctly for all training instances.

Deductions can be made just as well when only partial information is available. For example,
when a patient has swollen feet and suffers from hair loss, it may be concluded the patient has
superciliosis, regardless of whether or not the patient has red ears. This is so because the
unknown variable cannot force the sum to change to negative.

A system such as this can also explain how or why a conclusion was reached. For example,
when inputs and outputs are regarded as rules, an output can be explained as the conclusion to a
rule. If placibin is true, the system might explain why with a statement such as:

Placibin is TRUE due to the following rule:

                     IF Ptacibin Alergy (u6) is FALSE, and Superciliosis is TRUE

                                  THEN Conclude Placibin is TRUE.

Expert systems based on neural network architectures can be designed to possess many of the
features of other expert system types, including explanations for how and why, and confidence
estimation for variable deduction.

Top

Learning in Neural Networks

The perception, an invention of Rosenblatt [1962], was one of the earliest neural network models.
A perceptron models a neuron by taking a weighted sum of its inputs and sending the output 1if
the sum is greater than some adjustable threshold value (otherwise it sends 0). Figure 7.17
220                                                                                ARTIFICIAL INTELLIGENCE


shows the device. Notice that in a perceptron, unlike a Hopfield network, connections are
unidirectional.

The inputs (x1,x2....xn) and connection weights (w1, w2,... , wn) in the figure are typically real values,
both positive and negative. If the presence of some feature x, tends to cause the perceptron to
fire, the weight w, will be positive; if the feature x; inhibits the perceptron, the weight w, will be
negative. The perceptron itself consists of the weights, the summation processor, and the
adjustable threshold processor. Learning is a process of modifying the values of the weights and
the threshold. It is convenient to implement the threshold as just another weight WQ, as in Figure
7.18. This weight can be thought of as the propensity of the perceptron to fire irrespective of its
inputs. The perceptron of Figure 7.18 fires if the weighted sum is greater than zero.

A perceptron computes a binary function of its input. Several perceptrons can be combined to
compute more complex functions, as shown in Figure 7.19.




                                 Figure 7.17: A Neuron and a Perception




           Figure 7.18: Perceptron with Adjustable Threshold Implemented as Additional Weight
LEARNING                                                                                            221




                     Figure 7.19: A Perceptron with Many Inputs and Many Outputs


Such a group of perceptrons can be trained on sample input-output pairs until it learns to compute
the correct function. The amazing property of perceptron learning is this: Whatever a perceptron
can compute, it can learn to compute. We demonstrate this in a moment. At the time perceptrons
were invented, many people speculated that intelligent systems could be constructed out of
perceptrons (see Figure 7.20).

Since the perceptrons of Figure 7.19 are independent of one another, they can be separately
trained. So let us concentrate on what a single perceptron can learn to do. Consider the pattern
classification problem shown in Figure 7.21. This problem is linearly separable, because we can
draw a line that separates one class from another. Given values for x 1 and x2, we want to train a
perceptron to output 1 if it thinks the input belongs to the class of white dots and 0 if it thinks the
input belongs to the class of black dots. We have no explicit rule to guide us; we must induce a
rule from a set of training instances. We now see how perceptrons can learn to solve such
problems.

First, it is necessary to take a close look at what the perceptron computes. Let x an input vector
(x1, x2,……xn). Notice that the weighted summation function g(x) and the output function o(x) can
be defined as:
                                                    n
                                            g(x) = ∑ w x
                                                  i =0 i i
                                                  1 if g ( x ) > 0
                                          0(x ) = 
                                                  0 if g ( x ) > 0

Consider the case where we have only two inputs (as in Figure 7.22). Then:
222                                                                                  ARTIFICIAL INTELLIGENCE


                                         g (x ) = w 0 + w 1 x 1 + w 2x 2




           Figure 7.20: An Early Notion of an Intelligent System Built from Trainable Perceptions


If g(x) is exactly zero, the perceptron cannot decide whether to fire. A slight change in inputs
could cause the device to go either way. If we solve the equation g(x) = 0, we get the equation for
a line:
                                                         w      w
                                            x       =−     1 x − 0
                                                2        w    1 w
                                                           2      2
The location of the line is completely determined by the weights w0, w1, and w2. lf an input vector
lies on one side of the line, the perceptron will output 1; if it lies on the other side, the perceptron
will output 0. A line that correctly separates the training instances corresponds to a perfectly
functioning perceptron. Such a line is called a decision surface. In perceptrons with many inputs,
the decision surface will be a hyperplane through the multidimensional space of possible input
vectors. The problem of learning one of locating an appropriate decision surface.

We present a formal learning algorithm later. For now, consider the informal rule:

      If the perceptron fires when it should not fire, make each w, smaller by an amount
      proportional to xi. If the perceptron fails to fire when it should fire, make each w i larger by a
      similar amount.

Suppose we want to train a three-input perceptron to fire only when its first input is n. If the
perceptron fails to fire in the presence of an active x 1, we will increase w1 (and we may increase
other weights). If the perceptron fires incorrectly, we will end up decreasing weights that are not
LEARNING                                                                                            223


w1. (We will never decrease w1 because undesired firings only occur when x1 is 0, which forces
the proportional change in w1 also to be 0). In addition, w0 will find a value based on the total
number of incorrect firings versus incorrect misfirings. Soon, w 1 will become large enough to
overpower w0, w2 and w3 will not be powerful enough to fire the perception, even in the presence
of both x2 and x3.




                     Figure 7.21: A Linearly Separable Pattern Classification Problem


Now let us return to the function g(x) and o(x). While the sign of g(x) is critical to determining
whether the perception will fire, the magnitude is also important. The absolute value of g(x) tells
                                     
how far a given input vector x      lies from the decision surface. This gives us a way of
                                                 
characterizing how good a set of weights is. Let w be the weight vector (w0, w1,…wn), and let X
be the subset of training instances misclassified by the current set of weights. Then define the
perceptron criterion function, J w ) , to be the sum of the distances of the misclassified input
                                (
vectors from the decision surface:
                                                           
                                    (
                                              n
                                   J w ) = ∑ ∑Wi Xi = ∑ WX
                                             x ∈X i =0       x ∈X

                                                                                        
To create a better set of weights than the current set, we would like to reduce J w ) . Ultimately, if
                                                                                 (
                                         
all inputs are classified correctly, J w ) = 0 .
                                      (
                                         
How do we go about minimizing J w ) ? We can use a form of local-search hill climbing known as
                               (
                                                             
gradient descent. For our current purposes, think of J w ) as defining a surface in the space of
                                                      (
all possible weights. Such a surface might look like the one in Figure 7.22.

In the figure, weight w0 should be part of the weight space but is omitted here because it is easier
to visualize J in only three dimensions. Now, some of the weight vectors constitute solutions, in
that a perceptron with such a weight vector will classify all its inputs correctly. Note that there are
224                                                                                            ARTIFICIAL INTELLIGENCE

                                                                                   
an infinite number of solution vectors. For any solution vector w s , we know that JWs = 0 .              ( )
                                                
suppose we begin with a random weight vector w that is not a solution vector. We want to slide
down the J surface. There is a mathematical method for doing this—we compute the gradient of
                  
the function J w ) . Before we derive the gradient function, we reformulate the perception
              (
criterion function to remove the absolute value sign:

                                            
                                x if x is misclassif ied as a negative example
                          J( w ) = ∑ w   
                                   x ∈ X  – x if x is misclassif ied as a negative example




                                                                                   (
               Figure 7.22: Adjusting the Weights by Gradient Descent, Minimizing J w )


Recall that X is the set of misclassified input vectors.
                                             
Now, here is ∇, the gradient of J w ) with respect to the weight space:
              J                  (

                                             
                                 x if X is misclassif ied as a negative example
                          ∇ J( w ) = ∑   
                                     x ∈ X − x if x is misclassif ied as a positive example
                                          
The gradient is a vector that tells us the direction to move in the weight space in order to reduce
 (
J w ) . In order to find a solution weight vector, we simply change the weights in the direction of
LEARNING                                                                                                                 225

                                                                                                              
the gradient, recomputed J w ) , recomputed the new gradient, and iterate until J w ) = 0 . The
                          (                                                      (
rule for updating the weights at time t + 1 is:
                                                   
                                           w t +1 = w t + η∇J

Or in expanded form:

                                                 
                            x if x is misclassfi ed as a negative example
                         w t+ 1 = w t + η ∑   
                                          x ∈ X − x if x is misclassif ied as a positive example

 η is a scale factor that tells us how far to move in the direction of the gradient. A small η will

lead to slower learning, but a large η to be a constant gives us what is usually called the “fixed-
increment perceptron-learning algorithm”:

1.   Create a perception with n+1 inputs and n+1 weights, where the extra input X 0 is always set
     to 1.

2.   Initialize the weights (w0, w1,……wn) to random real values.

3.   Iterate through the training set, collecting all examples misclassified by the current set of
     weights.

4.   If all examples are classified correctly, output the weights and quit.

5.   Otherwise, compute the vector sum S of the misclassified input vectors, where each vector
                                                                                                    →     →
     has the form {X0, X1,…..Xn). In creating the sum, add to S a vector                            x   if x is an input for
                                                                                      →       →
     which the perceptron incorrectly fails lo fire, but add vector - x if x is an input for which the
     perceptron incorrectly fires. Multiply the sum by a scale factor η.

6.   Modify the weights (w0, w1…..wn) by adding the elements of the vector S to them. Go to step
     3.

The perceptron-learning algorithm is a search algorithm. It begins in a random initial state and
finds a solution state. The search space is simply all-possible assignments of real values to the
weights of the perceptron, and the search strategy is gradient descent.

So far, we have seen two search methods employed by neural networks, gradient descent in
perceptrons and parallel relaxation in Hopfield networks. It is important to understand the relation
226                                                                                ARTIFICIAL INTELLIGENCE


between the two. Parallel relaxation is a problem-solving strategy, analogous to state space
search in symbolic AI. Gradient descent is a learning strategy, analogous to techniques such as
version spaces. In both symbolic and connectionist AI, learning is viewed as a type of problem
solving, and this is why search is useful in learning. But the ultimate goal of learning is to get a
system into a position where it can solve problems better. Do not confuse learning algorithms with
others.

The perceptron convergence theorem, due to Rosenblatt [1962], guarantees that the perceptron
will find a solution state, i.e., it will learn to classify any linearly separable set of inputs. In other
words, the theorem shows that in the weight space, there are no local minima that do not
correspond to the global minimum. Figure 7.23 shows a perceptron learning to classify the
instances of Figure 7.21. Remember that every set of weights specifies some decision surface, in
this case some two-dimensional line. In the figure, k is the number of passes through the training
data, i.e., the number of iterations of steps 3 through 6 of the fixed-increment perceptron-learning
algorithm.

The introduction of perceptrons in the late 1950s created a great deal of excitement, here was a
device that strongly resembled a neuron and for which well-defined learning algorithms was
available. There was much speculation about how intelligent systems could be constructed from
perceptron building blocks. In their book Perceptrons, Minsky and Papert put an end to such
speculation by analyzing the computational capabilities of the devices. They noticed that while the
convergence theorem guaranteed correct classification of linearly separable data, most problems
do not supply such nice data. Indeed, the perceptron is incapable of learning to solve some very
simple problems.




                   Figure 7.23: A Perceptron Learning to Solve a Classification Problem
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)
Chaptr 7 (final)

Recommandé

AI: AI & Problem Solving par
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem SolvingDataminingTools Inc
36.4K vues17 diapositives
Hill climbing algorithm par
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithmDr. C.V. Suresh Babu
2.9K vues14 diapositives
AI: Learning in AI par
AI: Learning in AI AI: Learning in AI
AI: Learning in AI DataminingTools Inc
41.1K vues14 diapositives
State Space Search in ai par
State Space Search in aiState Space Search in ai
State Space Search in aivikas dhakane
596 vues10 diapositives
4 informed-search par
4 informed-search4 informed-search
4 informed-searchMhd Sb
6.3K vues79 diapositives
Hill climbing par
Hill climbingHill climbing
Hill climbingMohammad Faizan
67.9K vues49 diapositives

Contenu connexe

Tendances

I.BEST FIRST SEARCH IN AI par
I.BEST FIRST SEARCH IN AII.BEST FIRST SEARCH IN AI
I.BEST FIRST SEARCH IN AIvikas dhakane
2.1K vues7 diapositives
Artificial Intelligence Searching Techniques par
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesDr. C.V. Suresh Babu
3.1K vues38 diapositives
AI simple search strategies par
AI simple search strategiesAI simple search strategies
AI simple search strategiesRenas Rekany
2.8K vues55 diapositives
Informed and Uninformed search Strategies par
Informed and Uninformed search StrategiesInformed and Uninformed search Strategies
Informed and Uninformed search StrategiesAmey Kerkar
49.4K vues76 diapositives
Control Strategies in AI par
Control Strategies in AIControl Strategies in AI
Control Strategies in AIAmey Kerkar
28.9K vues76 diapositives
Artificial Intelligence -- Search Algorithms par
Artificial Intelligence-- Search Algorithms Artificial Intelligence-- Search Algorithms
Artificial Intelligence -- Search Algorithms Syed Ahmed
4.3K vues110 diapositives

Tendances(20)

AI simple search strategies par Renas Rekany
AI simple search strategiesAI simple search strategies
AI simple search strategies
Renas Rekany2.8K vues
Informed and Uninformed search Strategies par Amey Kerkar
Informed and Uninformed search StrategiesInformed and Uninformed search Strategies
Informed and Uninformed search Strategies
Amey Kerkar49.4K vues
Control Strategies in AI par Amey Kerkar
Control Strategies in AIControl Strategies in AI
Control Strategies in AI
Amey Kerkar28.9K vues
Artificial Intelligence -- Search Algorithms par Syed Ahmed
Artificial Intelligence-- Search Algorithms Artificial Intelligence-- Search Algorithms
Artificial Intelligence -- Search Algorithms
Syed Ahmed4.3K vues
Heuristic Search Techniques {Artificial Intelligence} par FellowBuddy.com
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
FellowBuddy.com32.6K vues
Uninformed search /Blind search in AI par Kirti Verma
Uninformed search /Blind search in AIUninformed search /Blind search in AI
Uninformed search /Blind search in AI
Kirti Verma1.1K vues
Unsupervised learning par amalalhait
Unsupervised learningUnsupervised learning
Unsupervised learning
amalalhait14.3K vues
Reinforcement learning 7313 par Slideshare
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare13.6K vues
I. Hill climbing algorithm II. Steepest hill climbing algorithm par vikas dhakane
I. Hill climbing algorithm II. Steepest hill climbing algorithmI. Hill climbing algorithm II. Steepest hill climbing algorithm
I. Hill climbing algorithm II. Steepest hill climbing algorithm
vikas dhakane922 vues
Knowledge representation in AI par Vishal Singh
Knowledge representation in AIKnowledge representation in AI
Knowledge representation in AI
Vishal Singh88.9K vues
Uninformed Search technique par Kapil Dahal
Uninformed Search techniqueUninformed Search technique
Uninformed Search technique
Kapil Dahal5K vues
Applications in Machine Learning par Joel Graff
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff2.5K vues

Similaire à Chaptr 7 (final)

Machine learning-in-details-with-out-python-code par
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeOsama Ghandour Geris
66 vues14 diapositives
Example Of Unsupervised Learning par
Example Of Unsupervised LearningExample Of Unsupervised Learning
Example Of Unsupervised LearningAndrea Christian
2 vues43 diapositives
5 learning edited 2012.ppt par
5 learning edited 2012.ppt5 learning edited 2012.ppt
5 learning edited 2012.pptHenokGetachew15
109 vues48 diapositives
Training_Report_on_Machine_Learning.docx par
Training_Report_on_Machine_Learning.docxTraining_Report_on_Machine_Learning.docx
Training_Report_on_Machine_Learning.docxShubhamBishnoi14
23 vues26 diapositives
Introduction to Machine Learning par
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningKmPooja4
775 vues20 diapositives
Deep learning Introduction and Basics par
Deep learning  Introduction and BasicsDeep learning  Introduction and Basics
Deep learning Introduction and BasicsNitin Mishra
95 vues34 diapositives

Similaire à Chaptr 7 (final)(20)

Introduction to Machine Learning par KmPooja4
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
KmPooja4775 vues
Deep learning Introduction and Basics par Nitin Mishra
Deep learning  Introduction and BasicsDeep learning  Introduction and Basics
Deep learning Introduction and Basics
Nitin Mishra95 vues
introduction to machine learning par Johnson Ubah
introduction to machine learningintroduction to machine learning
introduction to machine learning
Johnson Ubah178 vues
Real-world Reinforcement Learning par Max Pagels
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
Max Pagels1.2K vues
Real-world Reinforcement Learning par Max Pagels
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
Max Pagels602 vues
Business analytics Project.docx par kushi62
Business analytics Project.docxBusiness analytics Project.docx
Business analytics Project.docx
kushi6222 vues
Deep learning short introduction par Adwait Bhave
Deep learning short introductionDeep learning short introduction
Deep learning short introduction
Adwait Bhave463 vues
Information Systems Record Events On Log Files par Dana Boo
Information Systems Record Events On Log FilesInformation Systems Record Events On Log Files
Information Systems Record Events On Log Files
Dana Boo2 vues
unit 4 ai.pptx par ShivaDhfm
unit 4 ai.pptxunit 4 ai.pptx
unit 4 ai.pptx
ShivaDhfm135 vues
An Introduction to Machine Learning par Vedaj Padman
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
Vedaj Padman213 vues
Machine Learning Tutorial for Beginners par grinu
Machine Learning Tutorial for BeginnersMachine Learning Tutorial for Beginners
Machine Learning Tutorial for Beginners
grinu134 vues
Machine learning (domingo's paper) par Akhilesh Joshi
Machine learning (domingo's paper)Machine learning (domingo's paper)
Machine learning (domingo's paper)
Akhilesh Joshi508 vues

Plus de Nateshwar Kamlesh

Fuzzy logic par
Fuzzy logicFuzzy logic
Fuzzy logicNateshwar Kamlesh
1.1K vues12 diapositives
Contents par
ContentsContents
ContentsNateshwar Kamlesh
360 vues7 diapositives
Chapter 5 (final) par
Chapter 5 (final)Chapter 5 (final)
Chapter 5 (final)Nateshwar Kamlesh
4K vues19 diapositives
Chapter 2 (final) par
Chapter 2 (final)Chapter 2 (final)
Chapter 2 (final)Nateshwar Kamlesh
14.2K vues55 diapositives
Chapter 1 (final) par
Chapter 1 (final)Chapter 1 (final)
Chapter 1 (final)Nateshwar Kamlesh
5.7K vues27 diapositives
Ch 6 final par
Ch 6 finalCh 6 final
Ch 6 finalNateshwar Kamlesh
8.6K vues51 diapositives

Chaptr 7 (final)

  • 1. Concept of Learning Rote Learning Learning by Taking Advice Learning in Problem Solving Learning by Induction Explanation-Based Learning Learning Automation Learning in Neural Networks Unit 7 Learning Learning Objectives After reading this unit you should appreciate the following: • Concept of Learning • Learning Automation • Genetic Algorithm • Learning by Induction • Neural Networks • Learning in Neural Networks • Back Propagation Network Top Concept of Learning One of the most often heard criticisms of AI is that machines cannot be called intelligent until they are able to learn to do new things and to adapt to new situations, rather than simply doing as they are told to do. There can be little question that the ability to adapt to new surroundings and to solve new problems is an important characteristic of intelligent entities. Can we expect to see such abilities in programs? Ada Augusta, one of the earliest philosophers of computing, wrote that The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.
  • 2. LEARNING 179 Several AI critics have interpreted this remark as saying that computers cannot learn. In fact, it does not say that at all. Nothing prevents us from telling a computer how to interpret its inputs in such a way that its performance gradually improves. Rather than asking in advance whether it is possible for computers to "learn," it is much more enlightening to try to describe exactly what activities we mean when we say "learning" and what mechanisms could be used to enable us to perform those activities. Simon has proposed that learning denotes changes in the system that are adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more efficiently and more effectively the next time. As thus defined, learning covers a wide range of phenomena. At one end of the spectrum is skill refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or play tennis, the better you get. At the other end of the spectrum lies knowledge acquisition. As we have seen, many AI programs draw heavily on knowledge as their source of power. Knowledge is generally acquired through experience and such acquisition is the focus of this chapter. Knowledge acquisition itself includes many different activities. Simple storing of computed information, or rote learning, is the most basic learning activity. Many computer programs, e.g., database systems, can be said to "learn" in this sense, although most people would not call such simple storage, learning. However, many AI programs are able to improve their performance substantially through rote-learning technique and we will look at one example in depth, the checker-playing program of Samuel. Another way we learn is through taking advice from others. Advice taking is similar to rote learning, but high-level advice may not be in a form simple enough for a program to use directly in problem solving. The advice may need to be first operationalized. People also learn through their own problem-solving experience. After solving a Complex problem, we remember the structure of the problem and the methods we used to solve it. The next time we see the problem, we can solve it more efficiently. Moreover, we can generalize from our experience to solve related problems more easily contrast to advice taking, learning from problem-solving experience does not usually involve gathering new knowledge that was previously unavailable to the learning program. That is, the program remembers its experiences and generalizes from them, but does not add to the transitive closure of its knowledge, in the sense that an advice-taking program would, i.e., by receiving stimuli from the outside world. In
  • 3. 180 ARTIFICIAL INTELLIGENCE large problem spaces, however, efficiency gains are critical. Practically speaking, learning can mean the difference between solving a problem rapidly and not solving it at all. In addition, programs that learn though problem-solving experience may be able to come up with qualitatively better solutions in the future. Another form of learning that does involve stimuli from the outside is learning from examples. We often learn to classify things in the world without being given explicit rules. For example, adults can differentiate between cats and dogs, but small children often cannot. Somewhere along the line, we induce a method for telling cats from dogs - based on seeing numerous examples of each. Learning from examples usually involves a teacher who helps us classify things by correcting us when we are wrong. Sometimes, however, a program can discover things without the aid of a teacher. AI researchers have proposed many mechanisms for doing the kinds of learning described above. In this chapter, we discuss several of them. But keep in mind throughout this discussion that learning is itself a problem-solving process. In fact, it is very difficult to formulate a precise definition of learning that distinguishes it from other problem-solving tasks. Top Rote Learning When a computer stores a piece of data, it is performing a rudimentary form of learning. After all, this act of storage presumably allows the program to perform better in the future (otherwise, why bother?). In the case of data caching, we store computed values so that we do not have to recompute them later. When computation is more expensive than recall, this strategy can save a significant amount of time. Caching has been used in AI programs to produce some surprising performance improvements. Such caching is known as rote learning.
  • 4. LEARNING 181 Figure 7.1: Storing Backed-Up Values We mentioned one of the earliest game-playing programs, Samuel's checkers program. This program learned to play checkers well enough to beat its creator. It exploited two kinds of learning: rote learning and parameter (or coefficient) adjustment. Samuel's program used the minimax search procedure to explore checkers game trees. As is the case with all such programs, time constraints permitted it to search only a few levels in the tree. (The exact number varied depending on the situation.) When it could search no deeper, it applied its static evaluation function to the board position and used that score to continue its search of the game tree. When it finished searching the tree and propagating the values backward, it had a score for the position represented by the root of the tree. It could then choose the best move and make it. But it also recorded the board position at the root of the tree and the backed up score that had just been computed for it as shown in Figure 7.1(a). Now consider a situation as shown in Figure 7.1(b). Instead of using the static evaluation function to compute a score for position A, the stored value for A can be used. This creates the effect of
  • 5. 182 ARTIFICIAL INTELLIGENCE having searched an additional several ply since the stored value for A was computed by backing up values from exactly such a search. Rote learning of this sort is very simple. It does not appear to involve any sophisticated problem- solving capabilities. But even it shows the need for some capabilities that will become increasingly important in more complex learning systems. These capabilities include: 1. Organized Storage of Information-in order for it to be faster to use a stored value than it would to recompute it, there must be a way to access the appropriate stored value quickly. In Samuel's program, this was done by indexing board positions by a few important characteristics, such as the number of pieces. But as the complexity of the stored information increases, more sophisticated techniques are necessary. 2. Generalization-The number of distinct objects that might potentially be stored can be very large. To keep the number of stored objects down to a manageable level, some kind of generalization is necessary. In Samuel's program, for example, the number of distinct objects that could be stored was equal to the number of different board positions that can arise in a game. Only a few simple forms of generalization were used in Samuel's program to cut down that number. All positions are stored as though White is to move. This cuts the number of stored positions in half. When possible, rotations along the diagonal are also combined. Again, though, as the complexity of the learning process increases, so too does the need for generalization. . At this point, we have begun to see one way in which learning is similar to other kinds of problem solving. Its success depends on a good organizational structure for its knowledge base. Student Activity 7.1 Before reading next section, answer the following questions. 1. Discuss role of A.I. in learning. 2. What do you mean by skill refinement and knowledge acquisition? 3. Would it be reasonable to apply rote learning procedure to chess? Why? If your answers are correct, then proceed to the next section. Top Learning by Taking Advice
  • 6. LEARNING 183 A computer can do very little without a program for it to run. When a programmer writes a series of instructions into a computer, a rudimentary kind of learning is taking place. The programmer is a sort of teacher, and the computer is a sort of student. After being programmed, the computer is now able to do something it previously could not. Executing the program may not be such a simple matter, however. Suppose the program is written in a high-level language like LISP. Some interpreter or compiler must intervene to change the teacher's instructions into code that the machine can execute directly. People process advice in an analogous way. In chess, the advice "fight for control of the center of the board" is useless unless the player can translate the advice into concrete moves and plans. A computer program might make use of the advice by adjusting its static evaluation function to include a factor based on the number of center squares attacked by its own pieces. Mostow describes a program called FOO, which accepts advice for playing hearts, a card game. A human user first translates the advice from English into a representation that FOO can understand. For example, "Avoid taking points" becomes: (avoid (take-points me) (trick)) FOO must operationalize this advice by turning it into an expression that contains concepts and actions FOO can use when playing the game of hearts. One strategy FOO can follow is to UNFOLD an expression by replacing some term by its definition. By UNFOLDing the definition of avoid, FOO comes up with: (achieve (not (during (trick) (take-points me)))) FOO considers the advice to apply to the player called "me." Next, FOO UNFOLDs the definition of trick: (achieve (not (during (scenario (each pI (players) (play-card p1)) (take-trick (trick-winner)) (take-points me)))) In other words, the player should avoid taking points during the scenario consisting of (1) players playing cards and (2) one player taking the trick. FOO then uses case analysis to determine which
  • 7. 184 ARTIFICIAL INTELLIGENCE steps could cause one to take points. It rules out step 1 on the basis that it knows of no intersection of the concepts take-points and play-card. But step 2 could affect taking points, so FOO UNFOLDs the definition of take-points: (achieve (not (there-exists c1 (cards-played) (there-exists c2 (point-cards) (during (take (trick-winner) cl) (take me c2)))))) This advice says that the player should avoid taking point-cards during the process of the trick- winner taking the trick. The question for FOO now is: Under what conditions does (take me c2) occur during (take (trick-winner) cl)? By using a technique called partial match, FOO hypothesizes that points will be taken if me = trick-winner and c2 = cl. It transforms the advice into: (achieve (not (and (have-points (cards-played > = (trick-winner) me)))) This means, "Do not win a trick that has points." We have not travelled very far conceptually from "avoid taking points," but it is important to note that the current vocabulary is one that FOO can understand in terms of actually playing the game of t. hearts. Through a number of other transformations, FOO eventually settles on: (achieve (>= (and (in-suit-led (card-of me)) (possible (trick-has-points))) (low (card-of me))) In other words, when playing a card that is the same suit as the card that was played first, if the trick possibly contains points, then playa low card. At last, FOO has translated the rather vague advice "avoid taking points" into a specific, usable heuristic. FOO is able to playa better game of hearts after receiving this advice. A human can watch FOO play, detect new mistakes, and correct them through yet more advice, such as "play high cards when it is safe to do so." The ability to operationalize knowledge is critical for systems that learn from a teacher's advice. It is also an important component of explanation-based learning. Top Learning in Problem Solving
  • 8. LEARNING 185 In the last section, we saw how a problem solver could improve its performance by taking advice from a teacher. Can a program get better without the aid of a teacher? It can, by generalizing from its own experiences. Learning by Parameter Adjustment Many programs rely on an evaluation procedure that combines information from several sources into a single summary statistic. Game-playing programs do this in their static evaluation functions, in which a variety of factors, such as piece advantage and mobility, are combined into a single score reflecting the desirability of a particular board position. Pattern classification programs often combine several features to determine the correct category into which a given stimulus should be placed. In designing such programs, it is often difficult to know a priori how much weight should be attached to each feature being used. One way of finding the correct weights is to begin with some estimate of the correct settings and then to let the program modify the settings on the basis of its experience. Features that appear to be good predictors of overall success will have their weights increased, while those that do not will have their weights decreased, perhaps even to the point of being dropped entirely. Samuel's checkers program exploited this kind of learning in addition to the rote learning described above, and it provides a good example of its use. As its static evaluation function, the program used a polynomial of the form C1tl + C2t2 + ...+ C16t16 The t terms are the values of the sixteen features that contribute to the evaluation. The terms are the coefficients (weights) that are attached to each of these values. As learning progresses, the values will change. The most important question in the design of a learning program based on parameter adjustment is "When should the value of a coefficient be increased and when should it be decreased?" The second question to be answered is then "By how much should the value be changed? The simple answer to the first question is that the coefficients of terms that predicted the final outcome accurately should be increased, while the coefficients of poor predictors should be decreased. In some domains, this is easy to do. If a pattern classification program uses its evaluation function to classify an input and it gets the right answer, then all the terms that predicted that answer should have their weights increased. But in game-playing programs, the problem is more difficult. The program does not get any concrete feedback from individual moves. It does not find out for sure
  • 9. 186 ARTIFICIAL INTELLIGENCE until the end of the game whether it has won. But many moves have contributed to that final outcome. Even if the program wins, it may have made some bad moves along the way. The problem of appropriately assigning responsibility to each of the steps that led to a single outcome is known as the credit assignment problem. Samuel's program exploits one technique, albeit imperfect, for solving this problem. Assume that the initial values chosen for the coefficients are good enough that the total evaluation function produces values that are fairly reasonable measures of the correct score even if they are not as accurate as we hope to get them. Then this evaluation function can be used to provide feedback to itself. Move sequences that lead to positions with higher values can be considered good (and the terms in the evaluation function that suggested them can be reinforced). Because of the limitations of this approach, however, Samuel's program did two other things, one of which provided an additional test that progress was being made and the other of which generated additional nudges to keep the process out of a rut. When the program was in learning mode, it played against another copy of itself. Only one of the copies altered its scoring function during the game; the other remained fixed. At the end of the game, if the copy with the modified function won, then the modified function was accepted. Otherwise, the old one was retained. If, however, this happened very many times, then some drastic change was made to the function in an attempt to get the process going in a more profitable direction. Periodically, one term in the scoring function was eliminated and replaced by another. This was possible because, although the program used only sixteen features at anyone time, it actually knew about thirty-eight. This replacement differed from the rest of the learning procedure since it created a sudden change in the scoring function rather than a gradual shift in its weights. This process of learning by successive modifications to the weights of terms in a scoring function has many limitations, mostly arising out of its lack of exploitation of any knowledge about the structure of the problem with which it is dealing and the logical relationships among the problem's components. In addition, because the learning procedure is a variety of hill climbing, it suffers from the same difficulties, as do other hill-climbing programs. Parameter adjustment is certainly not a solution to the overall learning problem. But it is often a useful technique, either in situations where very little additional knowledge is available or in programs in which it is combined with more knowledge-intensive methods.
  • 10. LEARNING 187 Learning with Macro-Operators We saw that rote learning was used in the context of a checker-playing program. Similar techniques can be used in more general problem-solving programs. The idea is the same: to avoid expensive recomputation. For example, suppose you are faced with the problem of getting to the downtown post office. Your solution may involve getting in your car, starting it, and driving along a certain route. Substantial planning may go into choosing the appropriate route, but you need not plan about how to go about starting your car. You are free to treat START-CAR as an atomic action, even though it really consists of several actions: sitting down, adjusting the mirror, inserting the key, and turning the key. Sequences of actions that can be treated as a whole are called macro-operators. Macro-operators were used in the early problem-solving system STRIPS. After each problem- solving episode, the learning component takes the computed plan and stores it away as a macro- operator, or MACROP. A MACROP is just like a regular operator except that it consists of a sequence of actions. Not just a single one. A MACROP's preconditions are the initial conditions of the problem just solved and its postconditions correspond to the goal just achieved. In its simplest form, the caching of previously computed plans is similar to rote learning. Suppose we are given an initial blocks world situation in which ON(C, B) and ON(A, Table) are both true. STRIPS can achieve the goal ON(A, B) by devising a plan with the four steps UNSTACK(C, B). PUTDOWN(C), PICKUP (A), STACK(A, B). A STRIP now builds a MACROP with preconditions ON(C, B). ON(A, Table) and postconditions ON(C, Table). ON(A, B). The body of the MACROP consists of the four steps just mentioned. In future planning, STRIPS is free to use this complex macro-operator just as it would use any other operator. But rarely will STRIPS see the exact same problem twice. New problems will differ from previous problems. We would still like the problem solver to make efficient use of the knowledge it gained from its previous experiences. By generalizing MACROPs before storing them, STRIPS is able to accomplish this. The simplest idea for generalization is to replace all of the constants in the macro-operator by variables. Instead of storing the MACROP described in the previous paragraph, STRIPS can generalize the plan to consist of the steps UNSTACK(X1.X2). PUTDOWN(X1). PICKUP(X3), STACK(X3, X2). where X1. X2. and X3 are variables. This plan can then be stored with preconditions ON(X1, X2), ON(X3. Table) and postconditions ON(X1, Table). ON(X2, X3). Such a MACROP can now apply in a variety of situations.
  • 11. 188 ARTIFICIAL INTELLIGENCE Generalization is not so easy. Sometimes constants must retain their specific values. Suppose our domain included an operator called STACK-ON-B(X). With preconditions that both X and B be clear, and with postcondition ON(X, B). STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK- ~ ON-B(A). Let's generalize this plan and store it as a MACROP. The precondition " becomes ON(X3, X2), the postcondition becomes ON(X1, X2), and the plan itself becomes UNSTACK(X3, X2), PUTDOWN(X3), STACK-ON-B(X(). Now, suppose we encounter a slightly different problem: The generalized MACROP we just stored seems well-suited to solving this problem if we let X1 = A, X2 = C, and X3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C), PUTDOWN(E), STACK-ON-B(A). But this plan does not work. The problem is that the post condition of the MACROP is over generalized. This operation is only useful for stacking blocks onto B, which is not what we need in this new example. In this case, this difficulty will be discovered when the last step is attempted. Although we cleared C, which is where we wanted to put A, we failed to clear B, which is where the MACROP is going to try to put it. Since B is not clear, STACK-ON-B cannot be executed, If B had happened to be clear, the MACROP would have executed to completion, but it would not have accomplished the stated goal. In reality, STRIPS uses a more complex generalization procedure. First, all constants are replaced by variables. Then, for each operator in the parameterized plan, STRIPS reevaluates its preconditions. In our example, the preconditions of steps I and 2 are satisfied, but the only way to ensure that B is clear for step 3 is to assume that block X2, which was cleared by the UNSTACK operator, is actually block B. Through "re- proving" that the generalized plan works, STRIPS locates constraints of this kind. It turns out that the set of problems for which macro-operators are critical are exactly those problems with nonserializable subgoals. Nonserializability means that working on one subgoal will necessarily interfere with the previous solution to another subgoal. Recall that we discussed such problems in connection with nonlinear planning. Macro- operators can be useful in such cases, since one macro-operator can produce a small global change in the world, even though the individual operators that make it up produce many undesirable local changes. For example, consider the 8-puzzle. Once a program has correctly placed the first four tiles, it is difficult to place the fifth tile without disturbing the first four. Because disturbing previously solved subgoals is detected as a bad thing by heuristic scoring, functions, it is strongly resisted. For
  • 12. LEARNING 189 many problems, including the 8-puzzle and Rubik's cube, weak methods based on heuristic scoring are therefore insufficient. Hence, we either need domain-specific knowledge, or else a new weak method. Fortunately, we can learn the domain-specific knowledge we need in the form of macro-operators. Thus, macro-operators can be viewed as a weak method for learning. In the 8-puzzle, for example, we might have a macro a complex, prestored sequence of operators-for placing the fifth tile without disturbing any of the first four tiles externally (although in fact they are disturbed within the macro itself). This approach contrasts with STRIPS, which learned its MACROPs gradually, from experience. Korf's algorithm runs in time proportional to the time it takes to solve a single problem without macro-operators. Learning by Chunking Chunking is a process similar in flavor to macro-operators. The idea of chunking comes from the psychological literature on memory and problem solving. SOAR also exploits chunking so that its performance can increase with experience. In fact, the designers of SOAR hypothesize that chunking is a universal learning method, i.e., it can account for all types of learning in intelligent systems. SOAR solves problems by firing productions, which are stored in long-term memory. Some of those firings turn out to be more useful than others. When SOAR detects a useful sequence of production firings, it creates a chunk, which is essentially a large, production that does the work of an entile sequence of smaller ones. As in MACROPs, chunks are generalized before they are stored. Problems like choosing which subgoals to tackle and which operators to try (i.e., search control problems) are solved with the same mechanisms as problems in the original problem space. Because the problem solving is uniform, chunking can be used to learn general search control knowledge in addition to operator sequences. For example, if SOAR tries several different operators, but only one leads to a useful path in the search space, then SOAR builds productions that help it choose operators more wisely in the future. SOAR has used chunking to replicate the macro-operator results described in the last section. In solving the 8-puzzle, for example, SOAR learns how to place a given tile without permanently disturbing the previously placed tiles. Given the way that SOAR learns, several chunks may encode a single macro-operator, and one chunk may participate in a number of macro sequences. Chunks are generally applicable toward any goal state. This contrasts with macro tables, which are structured towards reaching a particular goal state from any initial state. Also, chunking emphasizes how learning can occur during
  • 13. 190 ARTIFICIAL INTELLIGENCE problem solving, while macro tables are usually built during a preprocessing stage. As a result, SOAR is able to learn within trials as well as across trials. Chunks learned during the initial stages of solving a problem are applicable in the later stages of the same problem-solving episode. After a solution is found, the chunks remain in memory, ready for use in the next problem. The price that SOAR pays for this generality and flexibility is speed. At present, chunking is inadequate for duplicating the contents of large, directly computed macro-operator tables. The Utility Problem PRODIGY employs several learning mechanisms. One mechanism uses explanation-based learning (EBL), a learning method we discuss in PRODIGY can examine a trace of its own problem- solving behavior and try to explain why certain paths failed. The program uses those explanations to formulate control rules that help the problem solver avoid those paths in the future. So while SOAR learns primarily from examples of successful problem solving, PRODIGY also learns from its failures. A major contribution of the work on EBL in PRODIGY was the identification of the utility problem in learning systems. While new search control knowledge can be of great benefit in solving future problems efficiently, there are also some drawbacks. The learned control rules can take up large amounts of memory and the search program must take the time to consider each rule at each step during problem solving. Considering a control rule amounts to seeing if its postconditions are desirable and seeing if its preconditions are satisfied. This is a' time-consuming process. So while learned rules may reduce problem-solving time by directing the search more carefully, they may also increase problem-solving time by forcing the problem solver to consider them. If we only want to minimize the number of node expansions in the search space, then the more control rules we learn, the better. But if we want to minimize the total CPU time required to solve a problem, we must consider this trade-off. PRODIGY maintains a utility measure for each control rule. This measure takes into account the average savings provided by the rule, the frequency of its application, and the cost of matching it. If a proposed rule has a negative utility, it is discarded (or "forgotten"). If not, it is placed in long- term memory with the other rules. It is then monitored during subsequent problem solving. If its utility falls, the rule is discarded. Empirical experiments have demonstrated the effectiveness of keeping only those control rules with high utility. Utility considerations apply to a wide range of AI
  • 14. LEARNING 191 learning systems. For example, for a discussion of how to deal with large, expensive chunks in SOAR. Top Learning by Induction Classification is the process of assigning, to a particular input, the name of a class to which it belongs. The classes from which the classification procedure can choose can be described in a variety of ways. Their definition will depend on the use to which they will be put. Classification is an important component of many problem-solving tasks. In its simplest form, it is presented as a straightforward recognition task. An example of this is the question "What letter of the alphabet is this?" But often classification is embedded inside another operation. To see how this can happen, consider a problem-solving system that contains the following production rule: If: the current goal is to get from place A to place B, and there is a WALL separating the two places then: look for a DOORWAY in the WALL and go through it. To use this rule successfully, the system's matching routine must be able to identify an object as a wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must be able to recognize a doorway. Before classification can be done, the classes it will use must be defined. This can be done in a variety of ways, including: Isolate a set of features that are relevant to the task domain. Define each class by a weighted sum of values of these features. Each class is then defined by a scoring function that looks very similar to the scoring functions often used in other situations, such as game playing. Such a function has the form. C1t1 + C2V2 + C3t3 + ... Each t corresponds to a value of a relevant parameter, and each c represents the weight to be attached to the corresponding t. Negative weights can be used to indicate features whose presence usually constitutes negative evidence for a given class.
  • 15. 192 ARTIFICIAL INTELLIGENCE For example, if the task is weather prediction, the parameters can be such measurements as rainfall and location of cold fronts. Different functions can be written to combine these parameters to predict sunny, cloudy, rainy, or snowy weather. Isolate a set of features that are relevant to the task domain. Define each class as a structure composed of those features. For example, if the task is to identify animals, the body of each type of animal can be stored as a structure, with various features representing such things as color, length of neck, and feathers. There are advantages and disadvantages to each of these general approaches. The statistical approach taken by the first scheme presented here is often more efficient than the structural approach taken by the second. But the second is more flexible and more extensible. Regardless of the way that classes are to be described, it is often difficult to construct, by hand, good class definitions. This is particularly true in domains that are not well understood or that change rapidly. Thus the idea of producing a classification program that can evolve its own class definitions is appealing. This task of constructing class definitions is called concept learning, or induction. The techniques used for this task J must, of course, depend on the way that classes (concepts) are described. If classes are described by scoring functions, then concept learning can be done using the technique of coefficient adjustment. If, however, we want to define classes structurally, some other technique for learning class definitions is necessary. In this section, we present three such techniques. Winston's Learning Program Winston describes an early structural concept-learning program. This program operated in a simple blocks world domain. Its goal was to construct representations of the definitions of concepts in the blocks domain. For example, it learned the concepts: House, Tent, and Arch shown in Figure 7.2. The figure also shows an example of a near miss for each concept. A near miss is an object that is not an instance of the concept in question but that is very similar to such instances. The program started with a line drawing of a blocks world structure. It used procedures to analyze the drawing and construct a semantic net representation of the structural description of the object(s). This structural description was then provided as input to the learning program. An example of such a structural description for the House of Figure 7.2 is shown in Figure 7.3(a). Node A represents the entire structure, which is composed of two parts: node B, a Wedge, and node C, a Brick. Figures 7.3(b) and 7.3(c) show descriptions of the two Arch structures of Figure 7.2. These
  • 16. LEARNING 193 descriptions are identical except for the types of the objects on the top, one is a Brick while the other is a Wedge. Notice that the two supporting objects are related not only by left-of and right-of links, but also by a does-not-marry link, which says that the two objects do not marry. Two objects marry if they have faces that touch, and they have a common edge. The marry relation is critical in the definition of an Arch. It is the difference between the first arch structure and the near miss arch structure shown in Figure 7.2. Figure 7.2: Some Blocks World Concepts The basic approach that Winston's program took to the problem of concept formation can be described as follows: 1. Begin with a structural description of one known instance of the concept. Call that description the concept definition. 2. Examine descriptions of other known instances of the concept. Generalize the definition to include them. 3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these. Steps 2 and 3 of this procedure can be interleaved. Steps 2 and 3 of this procedure rely heavily on a comparison process by which similarities and differences between structures can be detected. This process must function in much the same way as does any other matching process, such as one to determine whether a given production rule can be applied to a particular problem state. Because differences as well as similarities must
  • 17. 194 ARTIFICIAL INTELLIGENCE be found. The procedure must perform not just literal but also approximate matching. The output of the comparison procedure is a skeleton structure describing the commonalities between the two input structures. It is annotated with a set of comparison notes that describe specific similarities and differences between the inputs.
  • 18. LEARNING 195 To see how this approach works, we trace it through the process of learning what an arch is. Suppose that the arch description of Figure 7 .3(b) is presented first. It then becomes the definition of the concept Arch. Then suppose that the arch description of Figure 7.3(c) is presented. The
  • 19. 196 ARTIFICIAL INTELLIGENCE comparison routine will return a structure similar to the two input structures except that it will note that the objects represented by the nodes labeled C are not identical. This structure is shown as Figure 7.4. The c-note link from node C describes the difference found by the comparison routine. It notes that the difference occurred in the isa link, and that in the first structure the isa link pointed to Brick, and in the second it pointed to Wedge. It also notes that if we were to follow isa links from Brick and Wedge, these links would eventually merge. At this point, a new, description of the concept Arch can be generated. This description could say simply that node C must be either a Brick or a Wedge. But since this particular disjunction has no previously known significance, it is probably better to trace up the isa hierarchies of Brick and Wedge until they merge. Assuming that that happens at the node Object, the Arch definition shown in Figure 7.4 can be built. Figure 7.3: The Structural Description Figure 7.4 Figuer 7.4
  • 20. LEARNING 197 Figure 7.5: The Arch Description after Two Examples Next, suppose that the near miss arch is presented. This time, the comparison routine will note that the only difference between the current definition and the near miss is in the does-not-marry link between nodes B and D. But since this is a near miss, we do not want to broaden the definition to include it. Instead, we want to restrict the definition so that it is specifically excluded. To do this, we modify the link does-not-marry, which may simply be recording something that has happened by chance to be true of the small number of examples that have been presented. It must now say must-not-marry. Actually, must-not-marry should not be a completely new link. There must be some structure among link types to reflect the relationships between marry, does-not-marry, and must-not- marry. Notice how the problem-solving and knowledge representation techniques we covered in earlier chapters are brought to bear on the problem of learning. Semantic networks were used to describe block structures, and an isa hierarchy was used to describe relationships among already known objects. A matching process was used to detect similarities and differences between structures, and hill climbing allowed the program to evolve a more and more accurate concept definition. This approach to structural concept learning is not without its problems. One major problem is that a teacher must guide the learning program through a carefully chosen sequence of examples. In the next section, we explore a learning technique that is insensitive to the order in which examples are presented. Example Near Misses:
  • 21. 198 ARTIFICIAL INTELLIGENCE Goal Winston's Arch Learning Program models how we humans learn a concept such as "What is an Arch"? This is done by presenting to the "student" (the program) examples and counter examples (which are however very similar to examples). The latter are called "near misses." The program builds (Winston's version of) a semantic network. With every example and with every near miss, the semantic network gets better at predicting what exactly an arch is. Learning Procedure W 1. Let the description of the first sample which must be an example, be the initial description. 2. For all other samples. - If the sample is a near-miss call SPECIALIZE - If the sample is an example call GENERALIZE Specialize • Establish parts of near-miss to parts of model matching • IF there is a single most important difference THEN: – If the model has a link that is not in the near miss, convert it into a MUST link – If the near miss has a link that is not in the model, create a MUST-NOT-LINK, ELSE ignore the near-miss. Generalize 1. Establish a matching between parts of the sample and the model. 2. For each difference:
  • 22. LEARNING 199 – If the link in the model points to a super/sub-class of the link in the sample a MUST-BE link is established at the join point. – If the links point to classes of an exhaustive set, drop the link from the model. – Otherwise create a new common class and a MUST-BE arc to it. – If a link is missing in model or sample, drop it in the other. – If numbers are involved, create an interval that contains both numbers. – Otherwise, ignore the difference. Because the only difference is the missing support, and it is near miss, that are must be important. New Model
  • 23. 200 ARTIFICIAL INTELLIGENCE This uses specialize.(Actually it uses it twice, this is an exception). However, the following principles may be applied. 1. The teacher must use near misses and examples judiciously. 2. You cannot learn if you cannot know (=> KR) 3 You cannot know if you cannot distinguish relevant from irrelevant features. (=> Teachers can help here, by new missing.) 4. No-Guessing Principle: When there is doubt what to learn, learn nothing. 5. Learning Conventions called "Felicity Conditions" help. The teacher knows that there should be only one relevant difference, and the Student uses it. 6. No-Altering Principle: If an example fails the model, create an exception. (Penguins are birds). 7. Martin's Law: You can't learn anything unless you almost know it already. Learning Rules from Examples
  • 24. LEARNING 201 Given are 8 instances: 1. DECCG 2. AGDBC 3. GCFDC 4. CDFDE 5. CEGEC (Comment: "These are good") 1. CGCGF 2. DGFCD 3. ECGCD (Comment: "These are bad") The following rules perfectly describes warnings: (M(1,C) ∧ M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F)) ∨ (M(1,D) ∧ M(2,G) ∧ M(3,F) ∧ M(4,C) ∧ M(5,D)) ∨ (M(1,E) ∧ M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) ->Warning Now we eliminate the first conjunct from all 3 disjuncts: (M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F) ) ∨ (M(2,G) ∧ M(3,F) ∧ m(4,C) ∧ M(5,D)) ∨ (M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) -> Warning Now we check whether any of the good strings 1 - 5 would cause a warning. 1, 4, 5 are out at the first position. (Actually it is the second now). 2, 3 are out at the next (third) position. Thus, I don't really need the first conjunct in the Warning rule! I keep repeating this dropping business, until I am left with M(5,F) V M(5,D)-> warning
  • 25. 202 ARTIFICIAL INTELLIGENCE This rule contains all the knowledge contained in the examples! Note that this elimination approach is order dependant!! Working from right to left you would get: (M(1,C) ∧ M(2,6)) ∨ ( M(1,D) ∧ M(2,6)) ∨ M(1,E) -> warning Therefore the resulting rule is order dependent! There are N! (factorial!!!) orders we could try to delete features! Also there is no guarantee that the rules will work correctly on new instances. In fact rules are not even guaranteed to exist! To deal with N! problem, try to find "higher level features". For instance, if 2 features always occur together, replace them by one feature. Student Activity 7.2 Before reading next section, answer the following questions. 1. What do you mean by learning by induction. Explain with example? 2. Describe learning by Macro Operators? 3. Explain why a good knowledge representation is useful in reasoning knowledge. If your answers are correct, then proceed to the next section. Top Explanation-Based Learning The previous section illustrated how we can induce concept descriptions from positive and negative examples. Learning complex concepts using these procedures typically requires a substantial number of training instances. But people seem to be able to learn quite a bit from single examples. Consider a chess player who, as Black, has reached the position shown in Figure 7.6. The position is called a "fork" because the white knight attacks both the black king and the black queen. Black must move the king, thereby leaving the queen open to capture. From this single experience, Black is able to learn quite a bit about the fork trap: the idea is that if any piece x attacks both the opponent's king and another piece y, then piece y will be lost. We don't need to see dozens of positive and negative examples of fork positions in order to draw these conclusions. From just one experience, we can learn to avoid this trap in the future and perhaps to use it to our own advantage.
  • 26. LEARNING 203 What makes such single-example learning possible? The answer, not surprisingly, is knowledge. The chess player has plenty of domain-specific knowledge that can be brought to bear, including the rules of chess and any previously acquired strategies. That knowledge can be used to identify the critical aspects of the training example. In the case of the fork, we know that the double simultaneous attack is important while the precise position and type of the attacking piece is not. Much of the recent work in machine learning has moved away from the empirical, data-intensive approach described in the last section toward this more analytical, knowledge-intensive approach. A number of independent studies led to the characterization of this approach as explanation-based learning. An EBL system attempts to learn from a single example x by explaining why x is an example of the target concept. The explanation is then generalized, and the system's performance is improved through the availability of this knowledge. Figure 7.6: A Fork Position in Chess Mitchell et al. and DeJong and Mooney both describe general frame works for EBL programs and give general learning algorithms. We can think of EBL programs as accepting the following as input: 1. A Training Example-What the learning program "sees" in the world? 2. A Goal Concept-A high-level description of what the program is supposed to learn 3. An Operationality Criterion-A description of which concepts are usable 4. A Domain Theory-A set of rules that describe relationships between objects and actions in a domain.
  • 27. 204 ARTIFICIAL INTELLIGENCE From this, EBL computes a generalization of the training example that is sufficient to describe the goal concept, and also satisfies the operationality criterion. Let's look more closely at this specification. The training example is a familiar input-it is the same thing as the example in the version space algorithm. The goal concept is also familiar, but in previous sections, we have viewed the goal concept as an ~ output of the program, not an input. The assumption here is that the goal concept is not operational, just like the high-level card-playing advice. An EBL program seeks to operationalize the goal concept by expressing it in terms that a problem-solving program can understand. These terms are given by the operationality criterion. In the chess example, the goal concept might be something like "bad position for Black," and the operationalized concept would be a generalized description of situations similar to the training example, given in terms of pieces and their relative positions. The last input to an EBL program is a domain theory, in our case, the rules of chess. Without such knowledge, it is impossible to come up with a correct generalization of the training example. Explanation-based generalization (EBG) is an algorithm for EBL described in Mitchell et al. It has two steps: (1) explain and (2) generalize. During the first step, the domain theory is used to prune away all the unimportant aspects of the training example with respect to the goal concept. What is left is an explanation of why the training example is an instance of the goal concept. This explanation is expressed in terms that satisfy the operationality criterion. The next step is to generalize the explanation as far as possible while still describing the goal concept. Following our chess example, the first EBL step chooses to ignore White's pawns, king, and rook, and constructs an explanation consisting of White's knight, Black's king, and Black's queen, each in their specific positions. Operationality is ensured: all chess-playing programs understand the basic concepts of piece and position. Next, the explanation is generalized. Using domain knowledge, we find that moving the pieces to a different part of the board is still bad for Black. We can also determine that other pieces besides knights and queens can participate in fork attacks. In reality, current EBL methods run into difficulties in domains as complex as chess, so we will not pursue this example further. Instead, let's look at a simpler case. Consider the problem of learning the concept Cup. Unlike the arch-learning program we want to be able to generalize from a single example of a cup. Suppose the example is: Training Example: owner(Object23, Ralph) ∧ has-part(Object23, Concavity12) ∧
  • 28. LEARNING 205 is(Object23, Light) ∧ color(Object23, Brown) ∧ ... Clearly, some of the features of Object23 are more relevant to its being a cup than others. So far in this unit we have seen several methods for isolating relevant features. All these methods require many positive and negative examples. In EBL we instead rely on domain knowledge, such as: Domain Knowledge: is(x, Light) ∧ has-part(x, y) ∧ isa(y, Handle) → liftable(x) has-part(x, y) ∧ isa(y, Bottom) ∧ is(y, Flat) → stable(x) has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) → open-vessel(x) We also need a goal concept to operationalize: Goal Concept: Cup x is a Cup if x is liftable, stable, and open-vessel. OperationalityCriterion: Concept definition must be expressed in purely structural terms (e.g., Light, Flat, etc.). Given a training example and a functional description, we want to build a general structural description of a cup. The first step is to explain why Object23 is a cup. We do this by constructing a proof, as shown in Figure 7.7. Standard theorem-proving techniques can be used to find such a proof. Notice that the proof isolates the relevant features of the training example; nowhere in the proof do the predicates owner and color appear. The proof also serves as a basis for a valid generalization. If we gather up all the assumptions and replace constants with variables, we get the following description of a cup: has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) ∧ has-part(x, z) ∧ isa(z, Bottom) ∧ is(z, Flat) ∧ has-part(x, w) ∧ isa(w, Handle) ∧ is(x, Light) This definition satisfies the operationality criterion and could be used by a robot to classify objects.
  • 29. 206 ARTIFICIAL INTELLIGENCE Figure 7.7: An Explanation Simply replacing constants by variables worked in this example, but in some cases it is necessary to retain certain constants. To catch these cases, we must reprove the goal. This process, which we saw earlier in our discussion of learning in STRIPS, is called goal regression. As we have seen, EBL depends strongly on a domain theory. Given such a theory, why are examples needed at all? We could have operationalized the goal concept Cup without reference to an example, since the domain theory contains all of the requisite information. The answer is that examples help to focus the learning on relevant operationalizations. Without an example cup, EBL is faced with the task of characterizing the entire range of objects that satisfy the goal concept. Most of these objects will never be encountered in the real world, and so the result will be overly general. Providing a tractable domain theory is a difficult task. There is evidence that humans do not learn with very primitive relations. Instead, they create incomplete and inconsistent domain theories. For example, returning to chess, such a theory might " include concepts like "weak pawn structure." Getting EBL to work in ill-structured domain theories is an active area of research. EBL shares many features of all the learning methods described in earlier sections. , Like concept learning, EBL begins with a positive example of some concept. As in learning by advice taking, the goal is to operationalize some piece of knowledge. And EBL techniques, like the techniques of chunking and macro-operators, are often used to improve the performance of problem-solving engines. The major difference between EBL and other learning methods is that EBL programs are built to take advantage of domain knowledge. Since learning is just another kind of problem solving, it should come as no surprise that there is leverage to be found in knowledge.
  • 30. LEARNING 207 Top Learning Automation The theory of learning automata was first introduced in 1961 (Tsetlin, 1961). Since that time these systems have been studied intensely, both analytically and through simulations (Lakshmivarahan, 1981). Learning automata systems are finite set adaptive systems which interact iteratively with a general environment. Through a probabilistic trial-and-error response process they learn to choose or adapt to a behaviour that produces the best response. They are, essentially, a form of weak, inductive learners. In Figure 7.8, we see that the learning model for learning automata has been simplified for just two components, an automaton (learner) and an environment. The learning cycle begins with an input to the learning automata system from the environment. This input elicits one of a finite number of possible responses and then provides some form of feedback to the automaton in return. The automaton to alter its stimulus-response mapping structure to improve its behaviour in a more favourable way uses this feedback. As a simple example, suppose a learning automata is being used to learn the best temperature control setting for your office each morning. It may select any one of ten temperature range settings at the beginning of each day (Figure 7.9). Without any prior knowledge of your temperature preferences, the automaton randomly selects a first setting using the probability vector corresponding to the temperature settings. Figure 7.8: Learning Automaton Model
  • 31. 208 ARTIFICIAL INTELLIGENCE Figure 7.9: Temperature Control Model Since the probability values are uniformly distributed, any one of the settings will be selected with equal likelihood. After the selected temperature has stabilized, the environment may respond with a simple good-bad feedback response. If the response is good, the automata will modify its probability vector by rewarding the probability corresponding to the good setting with a positive increment and reducing all other probabilities proportionately to maintain the sum equal to 1. If the response is bad, the automaton will penalize the selected setting by reducing the probability corresponding to the bad setting and increasing all other values proportionately. This process is repeated each day until the good selections have high probability values and all bad choices have values near zero. Thereafter, the system will always choose the good settings. If, at some point, in the future your temperature preferences change, the automaton can easily readapt. Learning automata have been generalized and studied in various ways. One such generalization has been given the special name of collective learning automata (CLA). CLAs are standard learning automata systems except that feedback is not provided to the automaton after each response. In this case, several collective stimulus-response actions occur before feedback is passed to the automaton. It has been argued (Bock, 1976) that this type of learning more closely resembles that of human beings in that we usually perform a number or group of primitive actions before receiving feedback on the performance of such actions, such as solving a complete problem on a test or parking a car. We illustrate the operation of CLAs with an example of learning to play the game of Nim in an optimal way. Nim is a two-person zero-sum game in which the players alternate in removing tokens from an array that initially has nine tokens. The tokens are arranged into three rows with one token in the first row, three in the second row, and five in the third row (Figure 7.10).
  • 32. LEARNING 209 Figure 7.10: Nim Initial Configuration The first player must remove at least one token but not more than all the tokens in any single row. Tokens can only be removed from a single row during each payer’s move. The second player responds by removing one or more tokens remaining in any row. Players alternate in this way until all tokens have been removed; the loser is the player forced to remove the last token. We will use the triple (n1, n2, n3) to represent the states of the game at a given time where n 1, n2 and n3 are the numbers of tokens in rows 1, 2, and 3, respectively. We will also use a matrix to determine the moves made by the CLA for any given state. The matrix of Figure 7.11 has heading columns which correspond to the state of the game when it is the CLA’s turn to move, and row headings which correspond to the new game state after the CLA’s turn to move, and row headings which correspond to the new game state after the CLA has completed a move. Fractional entries in the matrix are transition probabilities used by the CLA to execute each of its moves. Asterrisks in the matrix represent invalid moves. Beginning with the initial state (1, 3, 5), suppose the CLA’s opponent removes two tokens from the third row resulting in the new state (1, 3, 3). If the ClA then removes all three tokens from the second row, the resultant state is (1, 0, 3). Suppose the opponent now removes all remaining tokens from the third row. This leaves the CLA with a losing configuration of (1, 0, 0). Figure 7.11: CLA Internal Representation of Game States As the start of the learning sequence, the matrix is initialized such that the elements in each column are equal (uniform) probability values. For example, since there are eight valid moves
  • 33. 210 ARTIFICIAL INTELLIGENCE from the state (1, 3, 4) each column element under this state corresponding to a valid move has been given uniform probability values corresponding to all valid moves for the given column state. The CLA selects moves probabilistically using the probability values in each column. So, for example, if the CLA had the first move, any row intersecting with the first column not containing 1 an asterisk would be chosen with probability 9 . This choice then determines the new game state from which the opponent must select a move. The opponent might have a similar matrix to record game states and choose moves. A complete game is played before the CLA is given any feedback, at which time it is informed whether or not its responses were good or bad. This is the collective feature of the CLA. If the CLA wins a game, increasing the probability value in each column corresponding to the winning move rewards all moves made by the CLA during that game. All non-winning probabilities in those columns are reduced equally to keep the sum in each column equal to 1. If the CLA loses a game, reducing the probability values corresponding to each losing move penalizes the moves leading to that loss. All other probabilities in the columns having a losing move are increased equally to keep the column totals equal to 1. After a number of games have been played by the CLA, the matrix elements that correspond to repeated wins will increase toward one, while all other elements in the column will decrease toward zero. Consequently, the CLA will choose the winning moves more frequently and thereby improve its performance. Simulated games between a CLA and various types of opponents have been performed and the results plotted (Bock, 1985). It was shown, for example, that two CLAs playing against each other required about 300 games before each learned to play optimally. Note, however, that convergence to optimality can be accomplished with fewer games if the opponent always plays optimally (or poorly), since, in such a case, the CLA will repeatedly lose (win) and quickly reduce (increase) the losing (winning) move elements to zero (one). It is also possible to speed up the learning process through the use of other techniques such as learned heuristics. Learning systems based on the learning automaton or CLA paradigm are fairly general for applications in which a suitable state representation scheme can be found. They are also quite robust learners. In fact, it has been shown that an LA will converge to an optimal distribution under fairly general conditions if the feedback is accurate with probability greater 0.5 (Narendra
  • 34. LEARNING 211 and Thathachar, 1974). Of course, the rate of convergence is strongly dependent on the reliability of the feedback. Learning automata are not very efficient learners as was noted in the game-playing example above. They are, however, relatively easy to implement, provided the number of states is not too large. When the number of states becomes large, the amount of storage and the computation required to update the transition matrix becomes excessive. Potential applications for learning automata include adaptive telephone routing and control. Such applications have been studied using simulation programs (Narendra et al., 1977). Genetic Algorithms Genetic algorithm learning methods are based on models of natural adaptation and evolution. These learning systems improve their performance through processes that model population genetics and survival of the fittest. They have been studied since the early 1960s (Holland, 1962, 1975). In the field of genetics, a population is subjected to an environment that places demands on the members. The members that adapt well are selected for mating and reproduction. The offspring of these better performers inherit genetic traits from both their parents. Members of this second generation of offspring, which also adapt well, are then selected for mating and reproduction and the evolutionary cycle continues. Poor performers die off without leaving offspring. Good performers produce good offspring and they, in turn, perform well. After some number of generations, the resultant population will have adapted optimally or at least very well to the environment. Genetic algorithm systems start with a fixed size population of data structures that are used to perform some given tasks. After requiring the structures to execute the specified tasks some number of times, the structures are rated on their performance, and a new generation of data structures is then created. Mating the higher performing structures to produce offspring creates the new generation. These offspring and their parents are then retained for the next generations while the poorer performing structures are discarded. The basic cycle is illustrated in Figure 7.12. Mutations are also performed on the best performing structures to insure that the full space of possible structures is reachable. This process is repeated for a number of generations until the resultant population consists of only the highest performing structures.
  • 35. 212 ARTIFICIAL INTELLIGENCE Data structures that make up the population can represent rules or any other suitable types of knowledge structure. To illustrate the genetic aspects of the problem, assume for simplicity that the populations of structures are fixed-length binary strings such as the eight-bit string 11010001. An initial population of these eight-bit strings would be generated randomly or with the use of heuristics at time zero. These strings, which might be simple condition an action rules, would then be assigned some tasks to perform (like predicting the weather based on certain physical and geographic conditions or diagnosing a fault in a piece of equipment). Figure 7.12: Genetic Algorithm After multiple attempts at executing the tasks, each of the participating structures would be rated and tagged with a utility value, u, commensurate with its performance. The next population would then be generated using the higher performing structures as parents and the process would be repeated with the newly produced generation. After many generations the remaining population structures should perform the desired tasks well. Mating between two strings is accomplished with the crossover operation which randomly selects a bit position in the eight-bit string and concatenates the head of one parent to the tail of the second parent to produce the offspring. Suppose the two parents are designated as xxxxxxxx and yyyyyyyy respectively, and suppose the third bit position has been selected as the crossover point (at the position of the colon in the structure xxx: xxxxx). After the crossover operation is applied,
  • 36. LEARNING 213 two offspring are then generated, namely xxxyyyyy and yyyxxxxx. Such offspring and their parents are then used to make up the next generation of structures. A second genetic operation often used is called inversion. Inversion is a transformation applied to a single string. A bit position is selected at random, and when applied to a structure; the inversion operation concatenates the tail of the string to the head of the same string. Thus, if the sixth position were selected (x1x2x3x4x5x6x7x8), the inverted string would be x7x8x1x2x3x4x5x6. A third operator, mutation, is used to insure that all locations of the rule space are reachable, that every potential rule in the rule space is available for evaluation. This insures that the selection process does not get caught in a local minimum. For example, it may happen that use of the crossover and inversion operators will only produce a set of structures that are better than all local neighbors but not optimal in a global sense. This can happen since crossover and inversion may not be able to produce some undiscovered structures. The mutation operators can overcome this by simply selecting any bit position in a string at random and changing it. This operator is typically used only infrequently to prevent random wandering in the search space. The genetic paradigm is best understood through an example. To illustrate similarities between the learning automation paradigm and the genetic paradigm we use the same learning task of the previous section, namely learning to play the game of Nim optimally. We use a slightly different representation scheme here since we want a population of structures that are easily transformed. To do this, we let each member of the population consist of a pair of triplets augmented with a utility value u, ((n1, n2, n3) (m1, m2, m3) u), where the first pair is the game state presented to the genetic algorithm system prior to its move, and the second triple is the state after the move. The u values represent the worth or current utility of the structure at any given time. Before the game begins, the genetic system randomly generates an initial population of K triple- pair members. The population size K is one of the important parameters that must be selected. Here, we simply assume it is about 25 or 30, which should be more than the number of moves needed for any optimal play. All members are assigned an initial utility value of 0. The learning process then proceeds as follows: 1. The environment presents a valid triple to the genetic system. 2. The genetic system searches for all population triple pairs that have a first triple that matches the input triple. From those that match, the first one found having the highest utility value u is selected and the second triple is returned as the new game state. If no match is found, the
  • 37. 214 ARTIFICIAL INTELLIGENCE genetic system randomly generates a triple which represents a valid move, returns this as the new state, and stores the triple pairs, the input, and newly generated triple as an addition to the population. 3. The above two steps are repeated until the game is terminated, in which case the genetic system is informed whether a win or loss occurred. If the system wins, each of the participating member moves has its utility value increased. If the system loses, each participating member has its utility value decreased. 4. The above steps are repeated until a fixed number of games have been played. At this time a new generation is created. The new generation is created from the old population by first selecting a fraction (say one half) of the members having the highest utility values. From these, offspring are obtained by application of appropriate genetic operators. The three operators, crossover, inversion, and mutation, randomly modify the parent moves to give new offspring move sequences. (The best choice of genetic operators to apply in this example is left as an exercise). Each offspring inherits a utility value from one of the parents. Population members having low utility values are discarded to keep the population size fixed. This whole process is repeated until the genetic system has learned all the optimal moves. This can be determined when the parent population ceases to change or when the genetic system repeatedly wins. The similarity between the learning automaton and genetic paradigms should be apparent from this example. Both rely on random move sequences, and the better moves are rewarded while the poorer ones are penalized. Neural Network Neural networks are large networks of simple processing elements or nodes that process information dynamically in response to external inputs. The nodes are simplified models of neurons. The knowledge in a neural network is distributed throughout the network in the form of internode connections and weighted links that form the inputs to the nodes. The link weights serve to enhance or inhibit the input stimuli values that are then added together at the nodes. If
  • 38. LEARNING 215 the sum of all the inputs to a node exceeds some threshold value T, the node executes and produces an output that is passed on to other nodes or is used to produce some output response. In the simplest case, no output is produced if the total input is less than T. In more complex models, the output will depend on a nonlinear activation function. Neural networks were originally inspired as being models of the human nervous system. They are greatly simplified models to be sure (neurons are known to be fairly complex processors). Even so, they have been shown to exhibit many "intelligent" abilities, such as learning, generalization, and abstraction. A single node is illustrated in Figure 7.13. The inputs to the node are the values X 1, X2 . . . , Xn which typically take on values of -1, 0, 1, or real values within the range (—1,1). The weights w1, w2,. . . ., wn, correspond to the synaptic strengths of a neuron. They serve to increase or decrease the effects of the corresponding x, input values. The sum of the products x i × wi i = 1,2,. . . , n, serve as the total combined input to the node. If this sum is large enough to exceed the threshold amount T, the node fires, and produces an output y, an activation function value placed on the node’s output links. The output may then be the input to other nodes or the final output response from the network. Figure 7.13: Model of a single neuron (node) Figure 7.14 illustrates three layers of a number of interconnected nodes. The first layer serves as the input layer, receiving inputs from some set of stimuli. The second layer (called the hidden layer layer) receive inputs from the first layer and produces a pattern of inputs to the third layer, the output layer. The. pattern of outputs from the final layer are the network's responses to the input stimuli patterns. Input links to layer j(j=1, 2, 3) have weights wij for I=1, 2,…….,n. General multiplayer networks having n nodes (number of rows) in each of m layers (number of columns of nodes) will have weights represented as an n x m matrix W. Using this representation, nodes having no interconnecting links will have a weight value of zero. Networks consisting of
  • 39. 216 ARTIFICIAL INTELLIGENCE more than three layers would, of course, be correspondingly more complex than the network depicted in Figure 7.14. A neural network can be thought of as a black box that transforms the input vector x to the output vector y where the transformation performed is the result of the pattern of connections and weights, that is, according to the values of the weight matrix W. Consider the vector product | x × w | = ∑x w i i There is a geometric interpretation for this product. It is equivalent to projecting one vector onto the other vector in n-dimensional space. This notion is depicted in Figure 7.15 for the two- dimensional case. The magnitude of the resultant vector is given by x × w = x || w | cos θ | here |x| denotes the norm or length of the vector x. Note that this product is maximum when both vectors point in the same direction, that is, when θ = 0. The product is a minimum when both point in opposite directions or when θ=180 degrees. This illustrates how the vectors in the weight matrix W influence the inputs to the nodes in a neutral network. Figure 7.14: A Multiplayer Neural Network
  • 40. LEARNING 217 Figure 7.15: Vector Multiplication is Like Vector Projecton Learning pattern weights The interconnections and weights W in the neural network store the knowledge possessed by the network. These weights must be preset or learned in some manner. When learning is used, the process may be either supervised or unsupervised. In the supervised case, learning is performed by repeatedly presenting the network with an input pattern and a desired output response. The training examples then consist of the vector pairs (x, y'), where x is the input pattern and y' is the desired output response pattern. The weights are then adjusted until the difference between the actual output response y and the desired response y' are the same, that is until D = y- y' is near zero. One of the simpler supervised learning algorithms uses the following formula to adjust the weights W. x Wnew = W + a × D × old | x| 2 Where 0 < a < 1 is a learning constant that determines the rate of learning. When the difference D is large, the adjustment to the weights W is large, but when the output response y is close to the target response y' the adjustment will be small. When the difference D is near zero, the training process terminates at which point the network will produce the correct response for the given input patterns x. In unsupervised learning, the training examples consist of the input vectors x only. No desired response y' is available to guide the system. Instead, the learning process must find the weights w, with no knowledge of the desired output response.
  • 41. 218 ARTIFICIAL INTELLIGENCE A neural net expert system An example of a simple neural network diagnostic expert system has been described by Stephen Gallant (1988). This system diagnoses and recommends treatments for acute sarcophagal disease. The system is illustrated in Figure 7.16. From six symptom variables u 1, u2,……., u6, one of two possible diseases can be diagnosed, u7 or ug. From the resultant diagnosis, one of three treatments, u9, u10, or u11 can then be recommended. Figure 7.16: A Simple Neural Network Expert System. When a given symptom is present, the corresponding variable is given a value of +1 (true). Negative symptoms are given an input value of -1 (false), and unknown symptoms are given the value 0. Input symptom values are multiplied by their corresponding weights W i0 Numbers within the nodes are initial bias weights Wi0 and numbers on the links are the other node input weights. When the sum of the weighted products of the inputs exceeds 0, an output will be present on the corresponding node output and serve as an input to the next layer of nodes.
  • 42. LEARNING 219 As an example, suppose the patient has swollen feet ⇐ +1) but not red ears (U2 = -1) nor hair loss (U8 = -1). This gives a value of U 7 = +1 (since 0+(2)(l)+(-2)(-l)+(3)(-l) = 1), suggesting the patient has superciliosis. When it is also known that the other symptoms of the patient are false (U 5 = U6 = l), it may be concluded that namatosis is absent (U8 = -1), and therefore that birambio (U10 = +1) should be prescribed while placibin should not be prescribed (U9 = -1). In addition, it will be found that posiboost should also be prescribed (u11 = +1). The training algorithm added the intermediate triangular shaped nodes. These additional nodes are needed so that weight assignments can be made which permit the computations to work correctly for all training instances. Deductions can be made just as well when only partial information is available. For example, when a patient has swollen feet and suffers from hair loss, it may be concluded the patient has superciliosis, regardless of whether or not the patient has red ears. This is so because the unknown variable cannot force the sum to change to negative. A system such as this can also explain how or why a conclusion was reached. For example, when inputs and outputs are regarded as rules, an output can be explained as the conclusion to a rule. If placibin is true, the system might explain why with a statement such as: Placibin is TRUE due to the following rule: IF Ptacibin Alergy (u6) is FALSE, and Superciliosis is TRUE THEN Conclude Placibin is TRUE. Expert systems based on neural network architectures can be designed to possess many of the features of other expert system types, including explanations for how and why, and confidence estimation for variable deduction. Top Learning in Neural Networks The perception, an invention of Rosenblatt [1962], was one of the earliest neural network models. A perceptron models a neuron by taking a weighted sum of its inputs and sending the output 1if the sum is greater than some adjustable threshold value (otherwise it sends 0). Figure 7.17
  • 43. 220 ARTIFICIAL INTELLIGENCE shows the device. Notice that in a perceptron, unlike a Hopfield network, connections are unidirectional. The inputs (x1,x2....xn) and connection weights (w1, w2,... , wn) in the figure are typically real values, both positive and negative. If the presence of some feature x, tends to cause the perceptron to fire, the weight w, will be positive; if the feature x; inhibits the perceptron, the weight w, will be negative. The perceptron itself consists of the weights, the summation processor, and the adjustable threshold processor. Learning is a process of modifying the values of the weights and the threshold. It is convenient to implement the threshold as just another weight WQ, as in Figure 7.18. This weight can be thought of as the propensity of the perceptron to fire irrespective of its inputs. The perceptron of Figure 7.18 fires if the weighted sum is greater than zero. A perceptron computes a binary function of its input. Several perceptrons can be combined to compute more complex functions, as shown in Figure 7.19. Figure 7.17: A Neuron and a Perception Figure 7.18: Perceptron with Adjustable Threshold Implemented as Additional Weight
  • 44. LEARNING 221 Figure 7.19: A Perceptron with Many Inputs and Many Outputs Such a group of perceptrons can be trained on sample input-output pairs until it learns to compute the correct function. The amazing property of perceptron learning is this: Whatever a perceptron can compute, it can learn to compute. We demonstrate this in a moment. At the time perceptrons were invented, many people speculated that intelligent systems could be constructed out of perceptrons (see Figure 7.20). Since the perceptrons of Figure 7.19 are independent of one another, they can be separately trained. So let us concentrate on what a single perceptron can learn to do. Consider the pattern classification problem shown in Figure 7.21. This problem is linearly separable, because we can draw a line that separates one class from another. Given values for x 1 and x2, we want to train a perceptron to output 1 if it thinks the input belongs to the class of white dots and 0 if it thinks the input belongs to the class of black dots. We have no explicit rule to guide us; we must induce a rule from a set of training instances. We now see how perceptrons can learn to solve such problems. First, it is necessary to take a close look at what the perceptron computes. Let x an input vector (x1, x2,……xn). Notice that the weighted summation function g(x) and the output function o(x) can be defined as: n g(x) = ∑ w x i =0 i i 1 if g ( x ) > 0 0(x ) =  0 if g ( x ) > 0 Consider the case where we have only two inputs (as in Figure 7.22). Then:
  • 45. 222 ARTIFICIAL INTELLIGENCE g (x ) = w 0 + w 1 x 1 + w 2x 2 Figure 7.20: An Early Notion of an Intelligent System Built from Trainable Perceptions If g(x) is exactly zero, the perceptron cannot decide whether to fire. A slight change in inputs could cause the device to go either way. If we solve the equation g(x) = 0, we get the equation for a line: w w x =− 1 x − 0 2 w 1 w 2 2 The location of the line is completely determined by the weights w0, w1, and w2. lf an input vector lies on one side of the line, the perceptron will output 1; if it lies on the other side, the perceptron will output 0. A line that correctly separates the training instances corresponds to a perfectly functioning perceptron. Such a line is called a decision surface. In perceptrons with many inputs, the decision surface will be a hyperplane through the multidimensional space of possible input vectors. The problem of learning one of locating an appropriate decision surface. We present a formal learning algorithm later. For now, consider the informal rule: If the perceptron fires when it should not fire, make each w, smaller by an amount proportional to xi. If the perceptron fails to fire when it should fire, make each w i larger by a similar amount. Suppose we want to train a three-input perceptron to fire only when its first input is n. If the perceptron fails to fire in the presence of an active x 1, we will increase w1 (and we may increase other weights). If the perceptron fires incorrectly, we will end up decreasing weights that are not
  • 46. LEARNING 223 w1. (We will never decrease w1 because undesired firings only occur when x1 is 0, which forces the proportional change in w1 also to be 0). In addition, w0 will find a value based on the total number of incorrect firings versus incorrect misfirings. Soon, w 1 will become large enough to overpower w0, w2 and w3 will not be powerful enough to fire the perception, even in the presence of both x2 and x3. Figure 7.21: A Linearly Separable Pattern Classification Problem Now let us return to the function g(x) and o(x). While the sign of g(x) is critical to determining whether the perception will fire, the magnitude is also important. The absolute value of g(x) tells  how far a given input vector x lies from the decision surface. This gives us a way of  characterizing how good a set of weights is. Let w be the weight vector (w0, w1,…wn), and let X be the subset of training instances misclassified by the current set of weights. Then define the perceptron criterion function, J w ) , to be the sum of the distances of the misclassified input ( vectors from the decision surface:   ( n J w ) = ∑ ∑Wi Xi = ∑ WX x ∈X i =0 x ∈X  To create a better set of weights than the current set, we would like to reduce J w ) . Ultimately, if (  all inputs are classified correctly, J w ) = 0 . (  How do we go about minimizing J w ) ? We can use a form of local-search hill climbing known as (  gradient descent. For our current purposes, think of J w ) as defining a surface in the space of ( all possible weights. Such a surface might look like the one in Figure 7.22. In the figure, weight w0 should be part of the weight space but is omitted here because it is easier to visualize J in only three dimensions. Now, some of the weight vectors constitute solutions, in that a perceptron with such a weight vector will classify all its inputs correctly. Note that there are
  • 47. 224 ARTIFICIAL INTELLIGENCE   an infinite number of solution vectors. For any solution vector w s , we know that JWs = 0 . ( )  suppose we begin with a random weight vector w that is not a solution vector. We want to slide down the J surface. There is a mathematical method for doing this—we compute the gradient of  the function J w ) . Before we derive the gradient function, we reformulate the perception ( criterion function to remove the absolute value sign:      x if x is misclassif ied as a negative example J( w ) = ∑ w    x ∈ X  – x if x is misclassif ied as a negative example ( Figure 7.22: Adjusting the Weights by Gradient Descent, Minimizing J w ) Recall that X is the set of misclassified input vectors.  Now, here is ∇, the gradient of J w ) with respect to the weight space: J (     x if X is misclassif ied as a negative example ∇ J( w ) = ∑    x ∈ X − x if x is misclassif ied as a positive example  The gradient is a vector that tells us the direction to move in the weight space in order to reduce ( J w ) . In order to find a solution weight vector, we simply change the weights in the direction of
  • 48. LEARNING 225   the gradient, recomputed J w ) , recomputed the new gradient, and iterate until J w ) = 0 . The ( ( rule for updating the weights at time t + 1 is:   w t +1 = w t + η∇J Or in expanded form:     x if x is misclassfi ed as a negative example w t+ 1 = w t + η ∑    x ∈ X − x if x is misclassif ied as a positive example η is a scale factor that tells us how far to move in the direction of the gradient. A small η will lead to slower learning, but a large η to be a constant gives us what is usually called the “fixed- increment perceptron-learning algorithm”: 1. Create a perception with n+1 inputs and n+1 weights, where the extra input X 0 is always set to 1. 2. Initialize the weights (w0, w1,……wn) to random real values. 3. Iterate through the training set, collecting all examples misclassified by the current set of weights. 4. If all examples are classified correctly, output the weights and quit. 5. Otherwise, compute the vector sum S of the misclassified input vectors, where each vector → → has the form {X0, X1,…..Xn). In creating the sum, add to S a vector x if x is an input for → → which the perceptron incorrectly fails lo fire, but add vector - x if x is an input for which the perceptron incorrectly fires. Multiply the sum by a scale factor η. 6. Modify the weights (w0, w1…..wn) by adding the elements of the vector S to them. Go to step 3. The perceptron-learning algorithm is a search algorithm. It begins in a random initial state and finds a solution state. The search space is simply all-possible assignments of real values to the weights of the perceptron, and the search strategy is gradient descent. So far, we have seen two search methods employed by neural networks, gradient descent in perceptrons and parallel relaxation in Hopfield networks. It is important to understand the relation
  • 49. 226 ARTIFICIAL INTELLIGENCE between the two. Parallel relaxation is a problem-solving strategy, analogous to state space search in symbolic AI. Gradient descent is a learning strategy, analogous to techniques such as version spaces. In both symbolic and connectionist AI, learning is viewed as a type of problem solving, and this is why search is useful in learning. But the ultimate goal of learning is to get a system into a position where it can solve problems better. Do not confuse learning algorithms with others. The perceptron convergence theorem, due to Rosenblatt [1962], guarantees that the perceptron will find a solution state, i.e., it will learn to classify any linearly separable set of inputs. In other words, the theorem shows that in the weight space, there are no local minima that do not correspond to the global minimum. Figure 7.23 shows a perceptron learning to classify the instances of Figure 7.21. Remember that every set of weights specifies some decision surface, in this case some two-dimensional line. In the figure, k is the number of passes through the training data, i.e., the number of iterations of steps 3 through 6 of the fixed-increment perceptron-learning algorithm. The introduction of perceptrons in the late 1950s created a great deal of excitement, here was a device that strongly resembled a neuron and for which well-defined learning algorithms was available. There was much speculation about how intelligent systems could be constructed from perceptron building blocks. In their book Perceptrons, Minsky and Papert put an end to such speculation by analyzing the computational capabilities of the devices. They noticed that while the convergence theorem guaranteed correct classification of linearly separable data, most problems do not supply such nice data. Indeed, the perceptron is incapable of learning to solve some very simple problems. Figure 7.23: A Perceptron Learning to Solve a Classification Problem