Chaptr 7 (final)

Concept of Learning
Rote Learning
Learning by Taking Advice
Learning in Problem Solving
Learning by Induction
Explanation-Based Learning
Learning Automation
Learning in Neural Networks

Unit 7
Learning

Learning Objectives
After reading this unit you should appreciate the following:

• Concept of Learning

• Learning Automation

• Genetic Algorithm

• Learning by Induction

• Neural Networks

• Learning in Neural Networks

• Back Propagation Network
Top

Concept of Learning

One of the most often heard criticisms of AI is that machines cannot be called intelligent until they
are able to learn to do new things and to adapt to new situations, rather than simply doing as they
are told to do. There can be little question that the ability to adapt to new surroundings and to
solve new problems is an important characteristic of intelligent entities. Can we expect to see
such abilities in programs? Ada Augusta, one of the earliest philosophers of computing, wrote that

The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we
know how to order it to perform.

LEARNING 179

Several AI critics have interpreted this remark as saying that computers cannot learn. In fact, it
does not say that at all. Nothing prevents us from telling a computer how to interpret its inputs in
such a way that its performance gradually improves.

Rather than asking in advance whether it is possible for computers to "learn," it is much more
enlightening to try to describe exactly what activities we mean when we say "learning" and what
mechanisms could be used to enable us to perform those activities. Simon has proposed that
learning denotes changes in the system that are adaptive in the sense that they enable the
system to do the same task or tasks drawn from the same population more efficiently and more
effectively the next time.

As thus defined, learning covers a wide range of phenomena. At one end of the spectrum is skill
refinement. People get better at many tasks simply by practicing. The more you ride a bicycle or
play tennis, the better you get. At the other end of the spectrum lies knowledge acquisition. As we
have seen, many AI programs draw heavily on knowledge as their source of power. Knowledge is
generally acquired through experience and such acquisition is the focus of this chapter.

Knowledge acquisition itself includes many different activities. Simple storing of computed
information, or rote learning, is the most basic learning activity. Many computer programs, e.g.,
database systems, can be said to "learn" in this sense, although most people would not call such
simple storage, learning. However, many AI programs are able to improve their performance
substantially through rote-learning technique and we will look at one example in depth, the
checker-playing program of Samuel.

Another way we learn is through taking advice from others. Advice taking is similar to rote
learning, but high-level advice may not be in a form simple enough for a program to use directly in
problem solving. The advice may need to be first operationalized.

People also learn through their own problem-solving experience. After solving a Complex
problem, we remember the structure of the problem and the methods we used to solve it. The
next time we see the problem, we can solve it more efficiently. Moreover, we can generalize from
our experience to solve related problems more easily contrast to advice taking, learning from
problem-solving experience does not usually involve gathering new knowledge that was
previously unavailable to the learning program. That is, the program remembers its experiences
and generalizes from them, but does not add to the transitive closure of its knowledge, in the
sense that an advice-taking program would, i.e., by receiving stimuli from the outside world. In

180 ARTIFICIAL INTELLIGENCE

large problem spaces, however, efficiency gains are critical. Practically speaking, learning can
mean the difference between solving a problem rapidly and not solving it at all. In addition,
programs that learn though problem-solving experience may be able to come up with qualitatively
better solutions in the future.

Another form of learning that does involve stimuli from the outside is learning from examples. We
often learn to classify things in the world without being given explicit rules. For example, adults
can differentiate between cats and dogs, but small children often cannot. Somewhere along the
line, we induce a method for telling cats from dogs - based on seeing numerous examples of
each. Learning from examples usually involves a teacher who helps us classify things by
correcting us when we are wrong. Sometimes, however, a program can discover things without
the aid of a teacher.

AI researchers have proposed many mechanisms for doing the kinds of learning described
above. In this chapter, we discuss several of them. But keep in mind throughout this discussion
that learning is itself a problem-solving process. In fact, it is very difficult to formulate a precise
definition of learning that distinguishes it from other problem-solving tasks.

Top

Rote Learning

When a computer stores a piece of data, it is performing a rudimentary form of learning. After all,
this act of storage presumably allows the program to perform better in the future (otherwise, why
bother?). In the case of data caching, we store computed values so that we do not have to
recompute them later. When computation is more expensive than recall, this strategy can save a
significant amount of time. Caching has been used in AI programs to produce some surprising
performance improvements. Such caching is known as rote learning.

LEARNING 181

Figure 7.1: Storing Backed-Up Values

We mentioned one of the earliest game-playing programs, Samuel's checkers program. This
program learned to play checkers well enough to beat its creator. It exploited two kinds of
learning: rote learning and parameter (or coefficient) adjustment. Samuel's program used the
minimax search procedure to explore checkers game trees. As is the case with all such
programs, time constraints permitted it to search only a few levels in the tree. (The exact number
varied depending on the situation.) When it could search no deeper, it applied its static evaluation
function to the board position and used that score to continue its search of the game tree. When it
finished searching the tree and propagating the values backward, it had a score for the position
represented by the root of the tree. It could then choose the best move and make it. But it also
recorded the board position at the root of the tree and the backed up score that had just been
computed for it as shown in Figure 7.1(a).

Now consider a situation as shown in Figure 7.1(b). Instead of using the static evaluation function
to compute a score for position A, the stored value for A can be used. This creates the effect of


having searched an additional several ply since the stored value for A was computed by backing
up values from exactly such a search.

Rote learning of this sort is very simple. It does not appear to involve any sophisticated problem-
solving capabilities. But even it shows the need for some capabilities that will become
increasingly important in more complex learning systems. These capabilities include:

1. Organized Storage of Information-in order for it to be faster to use a stored value than it
would to recompute it, there must be a way to access the appropriate stored value quickly. In
Samuel's program, this was done by indexing board positions by a few important
characteristics, such as the number of pieces. But as the complexity of the stored
information increases, more sophisticated techniques are necessary.

2. Generalization-The number of distinct objects that might potentially be stored can be very
large. To keep the number of stored objects down to a manageable level, some kind of
generalization is necessary. In Samuel's program, for example, the number of distinct
objects that could be stored was equal to the number of different board positions that can
arise in a game. Only a few simple forms of generalization were used in Samuel's program
to cut down that number. All positions are stored as though White is to move. This cuts the
number of stored positions in half. When possible, rotations along the diagonal are also
combined. Again, though, as the complexity of the learning process increases, so too does
the need for generalization. .

At this point, we have begun to see one way in which learning is similar to other kinds of problem
solving. Its success depends on a good organizational structure for its knowledge base.

Student Activity 7.1

Before reading next section, answer the following questions.

1. Discuss role of A.I. in learning.

2. What do you mean by skill refinement and knowledge acquisition?

3. Would it be reasonable to apply rote learning procedure to chess? Why?

If your answers are correct, then proceed to the next section.

Top

Learning by Taking Advice

LEARNING 183

A computer can do very little without a program for it to run. When a programmer writes a series
of instructions into a computer, a rudimentary kind of learning is taking place. The programmer is
a sort of teacher, and the computer is a sort of student. After being programmed, the computer is
now able to do something it previously could not. Executing the program may not be such a
simple matter, however. Suppose the program is written in a high-level language like LISP. Some
interpreter or compiler must intervene to change the teacher's instructions into code that the
machine can execute directly.

People process advice in an analogous way. In chess, the advice "fight for control of the center of
the board" is useless unless the player can translate the advice into concrete moves and plans. A
computer program might make use of the advice by adjusting its static evaluation function to
include a factor based on the number of center squares attacked by its own pieces.

Mostow describes a program called FOO, which accepts advice for playing hearts, a card game.
A human user first translates the advice from English into a representation that FOO can
understand. For example, "Avoid taking points" becomes:

(avoid (take-points me) (trick))

FOO must operationalize this advice by turning it into an expression that contains concepts and
actions FOO can use when playing the game of hearts. One strategy FOO can follow is to
UNFOLD an expression by replacing some term by its definition. By UNFOLDing the definition of
avoid, FOO comes up with:

(achieve (not (during (trick) (take-points me))))

FOO considers the advice to apply to the player called "me." Next, FOO UNFOLDs the definition
of trick:

(achieve (not (during

(scenario

(each pI (players) (play-card p1))

(take-trick (trick-winner))

(take-points me))))

In other words, the player should avoid taking points during the scenario consisting of (1) players
playing cards and (2) one player taking the trick. FOO then uses case analysis to determine which


steps could cause one to take points. It rules out step 1 on the basis that it knows of no
intersection of the concepts take-points and play-card. But step 2 could affect taking points, so
FOO UNFOLDs the definition of take-points:

(achieve (not (there-exists c1 (cards-played)

(there-exists c2 (point-cards)

(during (take (trick-winner) cl)

(take me c2))))))

This advice says that the player should avoid taking point-cards during the process of the trick-
winner taking the trick. The question for FOO now is: Under what conditions does (take me c2)
occur during (take (trick-winner) cl)? By using a technique called partial match, FOO hypothesizes
that points will be taken if me = trick-winner and c2 = cl. It transforms the advice into:

(achieve (not (and (have-points (cards-played > = (trick-winner) me))))

This means, "Do not win a trick that has points." We have not travelled very far conceptually from
"avoid taking points," but it is important to note that the current vocabulary is one that FOO can
understand in terms of actually playing the game of t. hearts. Through a number of other
transformations, FOO eventually settles on:

(achieve (>= (and (in-suit-led (card-of me))

(possible (trick-has-points)))

(low (card-of me)))

In other words, when playing a card that is the same suit as the card that was played first, if the
trick possibly contains points, then playa low card. At last, FOO has translated the rather vague
advice "avoid taking points" into a specific, usable heuristic. FOO is able to playa better game of
hearts after receiving this advice. A human can watch FOO play, detect new mistakes, and
correct them through yet more advice, such as "play high cards when it is safe to do so." The
ability to operationalize knowledge is critical for systems that learn from a teacher's advice. It is
also an important component of explanation-based learning.

Top

Learning in Problem Solving

LEARNING 185

In the last section, we saw how a problem solver could improve its performance by taking advice
from a teacher. Can a program get better without the aid of a teacher? It can, by generalizing from
its own experiences.

Learning by Parameter Adjustment

Many programs rely on an evaluation procedure that combines information from several sources
into a single summary statistic. Game-playing programs do this in their static evaluation functions,
in which a variety of factors, such as piece advantage and mobility, are combined into a single
score reflecting the desirability of a particular board position. Pattern classification programs
often combine several features to determine the correct category into which a given stimulus
should be placed. In designing such programs, it is often difficult to know a priori how much weight
should be attached to each feature being used. One way of finding the correct weights is to begin
with some estimate of the correct settings and then to let the program modify the settings on the
basis of its experience. Features that appear to be good predictors of overall success will have
their weights increased, while those that do not will have their weights decreased, perhaps even
to the point of being dropped entirely.

Samuel's checkers program exploited this kind of learning in addition to the rote learning
described above, and it provides a good example of its use. As its static evaluation function, the
program used a polynomial of the form

C1tl + C2t2 + ...+ C16t16

The t terms are the values of the sixteen features that contribute to the evaluation. The terms are
the coefficients (weights) that are attached to each of these values. As learning progresses, the
values will change.

The most important question in the design of a learning program based on parameter adjustment
is "When should the value of a coefficient be increased and when should it be decreased?" The
second question to be answered is then "By how much should the value be changed? The simple
answer to the first question is that the coefficients of terms that predicted the final outcome
accurately should be increased, while the coefficients of poor predictors should be decreased. In
some domains, this is easy to do. If a pattern classification program uses its evaluation function to
classify an input and it gets the right answer, then all the terms that predicted that answer should
have their weights increased. But in game-playing programs, the problem is more difficult. The
program does not get any concrete feedback from individual moves. It does not find out for sure


until the end of the game whether it has won. But many moves have contributed to that final
outcome. Even if the program wins, it may have made some bad moves along the way. The
problem of appropriately assigning responsibility to each of the steps that led to a single outcome
is known as the credit assignment problem.

Samuel's program exploits one technique, albeit imperfect, for solving this problem. Assume that
the initial values chosen for the coefficients are good enough that the total evaluation function
produces values that are fairly reasonable measures of the correct score even if they are not as
accurate as we hope to get them. Then this evaluation function can be used to provide feedback
to itself. Move sequences that lead to positions with higher values can be considered good (and
the terms in the evaluation function that suggested them can be reinforced).

Because of the limitations of this approach, however, Samuel's program did two other things, one
of which provided an additional test that progress was being made and the other of which
generated additional nudges to keep the process out of a rut.

When the program was in learning mode, it played against another copy of itself. Only one of the
copies altered its scoring function during the game; the other remained fixed. At the end of the
game, if the copy with the modified function won, then the modified function was accepted.
Otherwise, the old one was retained. If, however, this happened very many times, then some
drastic change was made to the function in an attempt to get the process going in a more
profitable direction.

Periodically, one term in the scoring function was eliminated and replaced by another. This was
possible because, although the program used only sixteen features at anyone time, it actually
knew about thirty-eight. This replacement differed from the rest of the learning procedure since it
created a sudden change in the scoring function rather than a gradual shift in its weights.

This process of learning by successive modifications to the weights of terms in a scoring function
has many limitations, mostly arising out of its lack of exploitation of any knowledge about the
structure of the problem with which it is dealing and the logical relationships among the problem's
components. In addition, because the learning procedure is a variety of hill climbing, it suffers
from the same difficulties, as do other hill-climbing programs. Parameter adjustment is certainly
not a solution to the overall learning problem. But it is often a useful technique, either in situations
where very little additional knowledge is available or in programs in which it is combined with
more knowledge-intensive methods.

LEARNING 187

Learning with Macro-Operators

We saw that rote learning was used in the context of a checker-playing program. Similar
techniques can be used in more general problem-solving programs. The idea is the same: to
avoid expensive recomputation. For example, suppose you are faced with the problem of getting
to the downtown post office. Your solution may involve getting in your car, starting it, and driving
along a certain route. Substantial planning may go into choosing the appropriate route, but you
need not plan about how to go about starting your car. You are free to treat START-CAR as an
atomic action, even though it really consists of several actions: sitting down, adjusting the mirror,
inserting the key, and turning the key. Sequences of actions that can be treated as a whole are
called macro-operators.

Macro-operators were used in the early problem-solving system STRIPS. After each problem-
solving episode, the learning component takes the computed plan and stores it away as a macro-
operator, or MACROP. A MACROP is just like a regular operator except that it consists of a
sequence of actions. Not just a single one. A MACROP's preconditions are the initial conditions of
the problem just solved and its postconditions correspond to the goal just achieved. In its simplest
form, the caching of previously computed plans is similar to rote learning.

Suppose we are given an initial blocks world situation in which ON(C, B) and ON(A, Table) are
both true. STRIPS can achieve the goal ON(A, B) by devising a plan with the four steps
UNSTACK(C, B). PUTDOWN(C), PICKUP (A), STACK(A, B). A STRIP now builds a MACROP
with preconditions ON(C, B). ON(A, Table) and postconditions ON(C, Table). ON(A, B). The body
of the MACROP consists of the four steps just mentioned. In future planning, STRIPS is free to
use this complex macro-operator just as it would use any other operator.

But rarely will STRIPS see the exact same problem twice. New problems will differ from previous
problems. We would still like the problem solver to make efficient use of the knowledge it gained
from its previous experiences. By generalizing MACROPs before storing them, STRIPS is able to
accomplish this. The simplest idea for generalization is to replace all of the constants in the
macro-operator by variables. Instead of storing the MACROP described in the previous
paragraph, STRIPS can generalize the plan to consist of the steps UNSTACK(X1.X2).
PUTDOWN(X1). PICKUP(X3), STACK(X3, X2). where X1. X2. and X3 are variables. This plan can
then be stored with preconditions ON(X1, X2), ON(X3. Table) and postconditions ON(X1, Table).
ON(X2, X3). Such a MACROP can now apply in a variety of situations.


Generalization is not so easy. Sometimes constants must retain their specific values. Suppose
our domain included an operator called STACK-ON-B(X). With preconditions that both X and B
be clear, and with postcondition ON(X, B).

STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK- ~ ON-B(A). Let's
generalize this plan and store it as a MACROP. The precondition " becomes ON(X3, X2), the
postcondition becomes ON(X1, X2), and the plan itself becomes UNSTACK(X3, X2),
PUTDOWN(X3), STACK-ON-B(X(). Now, suppose we encounter a slightly different problem:

The generalized MACROP we just stored seems well-suited to solving this problem if we let X1 =
A, X2 = C, and X3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C),
PUTDOWN(E), STACK-ON-B(A). But this plan does not work. The problem is that the post
condition of the MACROP is over generalized. This operation is only useful for stacking blocks
onto B, which is not what we need in this new example. In this case, this difficulty will be
discovered when the last step is attempted. Although we cleared C, which is where we wanted to
put A, we failed to clear B, which is where the MACROP is going to try to put it. Since B is not
clear, STACK-ON-B cannot be executed, If B had happened to be clear, the MACROP would
have executed to completion, but it would not have accomplished the stated goal.

In reality, STRIPS uses a more complex generalization procedure. First, all constants are
replaced by variables. Then, for each operator in the parameterized plan, STRIPS reevaluates its
preconditions. In our example, the preconditions of steps I and 2 are satisfied, but the only way to
ensure that B is clear for step 3 is to assume that block X2, which was cleared by the UNSTACK
operator, is actually block B. Through "re- proving" that the generalized plan works, STRIPS
locates constraints of this kind.

It turns out that the set of problems for which macro-operators are critical are exactly those
problems with nonserializable subgoals. Nonserializability means that working on one subgoal will
necessarily interfere with the previous solution to another subgoal. Recall that we discussed such
problems in connection with nonlinear planning. Macro- operators can be useful in such cases,
since one macro-operator can produce a small global change in the world, even though the
individual operators that make it up produce many undesirable local changes.

For example, consider the 8-puzzle. Once a program has correctly placed the first four tiles, it is
difficult to place the fifth tile without disturbing the first four. Because disturbing previously solved
subgoals is detected as a bad thing by heuristic scoring, functions, it is strongly resisted. For

LEARNING 189

many problems, including the 8-puzzle and Rubik's cube, weak methods based on heuristic
scoring are therefore insufficient. Hence, we either need domain-specific knowledge, or else a
new weak method. Fortunately, we can learn the domain-specific knowledge we need in the form
of macro-operators. Thus, macro-operators can be viewed as a weak method for learning. In the
8-puzzle, for example, we might have a macro a complex, prestored sequence of operators-for
placing the fifth tile without disturbing any of the first four tiles externally (although in fact they are
disturbed within the macro itself). This approach contrasts with STRIPS, which learned its
MACROPs gradually, from experience. Korf's algorithm runs in time proportional to the time it
takes to solve a single problem without macro-operators.

Learning by Chunking

Chunking is a process similar in flavor to macro-operators. The idea of chunking comes from the
psychological literature on memory and problem solving. SOAR also exploits chunking so that its
performance can increase with experience. In fact, the designers of SOAR hypothesize that
chunking is a universal learning method, i.e., it can account for all types of learning in intelligent
systems.

SOAR solves problems by firing productions, which are stored in long-term memory. Some of
those firings turn out to be more useful than others. When SOAR detects a useful sequence of
production firings, it creates a chunk, which is essentially a large, production that does the work of
an entile sequence of smaller ones. As in MACROPs, chunks are generalized before they are
stored.

Problems like choosing which subgoals to tackle and which operators to try (i.e., search control
problems) are solved with the same mechanisms as problems in the original problem space.
Because the problem solving is uniform, chunking can be used to learn general search control
knowledge in addition to operator sequences. For example, if SOAR tries several different
operators, but only one leads to a useful path in the search space, then SOAR builds productions
that help it choose operators more wisely in the future. SOAR has used chunking to replicate the
macro-operator results described in the last section. In solving the 8-puzzle, for example, SOAR
learns how to place a given tile without permanently disturbing the previously placed tiles. Given
the way that SOAR learns, several chunks may encode a single macro-operator, and one chunk
may participate in a number of macro sequences. Chunks are generally applicable toward any
goal state. This contrasts with macro tables, which are structured towards reaching a particular
goal state from any initial state. Also, chunking emphasizes how learning can occur during


problem solving, while macro tables are usually built during a preprocessing stage. As a result,
SOAR is able to learn within trials as well as across trials. Chunks learned during the initial stages
of solving a problem are applicable in the later stages of the same problem-solving episode. After
a solution is found, the chunks remain in memory, ready for use in the next problem.

The price that SOAR pays for this generality and flexibility is speed. At present, chunking is
inadequate for duplicating the contents of large, directly computed macro-operator tables.

The Utility Problem

PRODIGY employs several learning mechanisms. One mechanism uses explanation-based learning
(EBL), a learning method we discuss in PRODIGY can examine a trace of its own problem-
solving behavior and try to explain why certain paths failed. The program uses those explanations
to formulate control rules that help the problem solver avoid those paths in the future. So while

SOAR learns primarily from examples of successful problem solving, PRODIGY also learns from
its failures.

A major contribution of the work on EBL in PRODIGY was the identification of the utility problem in
learning systems. While new search control knowledge can be of great benefit in solving future
problems efficiently, there are also some drawbacks. The learned control rules can take up large
amounts of memory and the search program must take the time to consider each rule at each
step during problem solving. Considering a control rule amounts to seeing if its postconditions are
desirable and seeing if its preconditions are satisfied. This is a' time-consuming process. So while
learned rules may reduce problem-solving time by directing the search more carefully, they may
also increase problem-solving time by forcing the problem solver to consider them. If we only
want to minimize the number of node expansions in the search space, then the more control rules
we learn, the better. But if we want to minimize the total CPU time required to solve a problem,
we must consider this trade-off.

PRODIGY maintains a utility measure for each control rule. This measure takes into account the
average savings provided by the rule, the frequency of its application, and the cost of matching it.
If a proposed rule has a negative utility, it is discarded (or "forgotten"). If not, it is placed in long-
term memory with the other rules. It is then monitored during subsequent problem solving. If its
utility falls, the rule is discarded. Empirical experiments have demonstrated the effectiveness of
keeping only those control rules with high utility. Utility considerations apply to a wide range of AI

LEARNING 191

learning systems. For example, for a discussion of how to deal with large, expensive chunks in
SOAR.

Top

Learning by Induction

Classification is the process of assigning, to a particular input, the name of a class to which it
belongs. The classes from which the classification procedure can choose can be described in a
variety of ways. Their definition will depend on the use to which they will be put.

Classification is an important component of many problem-solving tasks. In its simplest form, it is
presented as a straightforward recognition task. An example of this is the question "What letter of
the alphabet is this?" But often classification is embedded inside another operation. To see how
this can happen, consider a problem-solving system that contains the following production rule:
If: the current goal is to get from place A to place B, and

there is a WALL separating the two places

then: look for a DOORWAY in the WALL and go through it.

To use this rule successfully, the system's matching routine must be able to identify an object as
a wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must be
able to recognize a doorway.

Before classification can be done, the classes it will use must be defined. This can be done in a
variety of ways, including:

Isolate a set of features that are relevant to the task domain. Define each class by a weighted
sum of values of these features. Each class is then defined by a scoring function that looks very
similar to the scoring functions often used in other situations, such as game playing. Such a
function has the form.

C1t1 + C2V2 + C3t3 + ...

Each t corresponds to a value of a relevant parameter, and each c represents the weight to be
attached to the corresponding t. Negative weights can be used to indicate features whose
presence usually constitutes negative evidence for a given class.


For example, if the task is weather prediction, the parameters can be such measurements as
rainfall and location of cold fronts. Different functions can be written to combine these parameters
to predict sunny, cloudy, rainy, or snowy weather.

Isolate a set of features that are relevant to the task domain. Define each class as a structure
composed of those features. For example, if the task is to identify animals, the body of each type
of animal can be stored as a structure, with various features representing such things as color,
length of neck, and feathers.

There are advantages and disadvantages to each of these general approaches. The statistical
approach taken by the first scheme presented here is often more efficient than the structural
approach taken by the second. But the second is more flexible and more extensible.

Regardless of the way that classes are to be described, it is often difficult to construct, by hand,
good class definitions. This is particularly true in domains that are not well understood or that
change rapidly. Thus the idea of producing a classification program that can evolve its own class
definitions is appealing. This task of constructing class definitions is called concept learning, or
induction. The techniques used for this task J must, of course, depend on the way that classes
(concepts) are described. If classes are described by scoring functions, then concept learning can
be done using the technique of coefficient adjustment. If, however, we want to define classes
structurally, some other technique for learning class definitions is necessary. In this section, we
present three such techniques.

Winston's Learning Program

Winston describes an early structural concept-learning program. This program operated in a
simple blocks world domain. Its goal was to construct representations of the definitions of
concepts in the blocks domain. For example, it learned the concepts: House, Tent, and Arch shown
in Figure 7.2. The figure also shows an example of a near miss for each concept. A near miss is an
object that is not an instance of the concept in question but that is very similar to such instances.

The program started with a line drawing of a blocks world structure. It used procedures to analyze
the drawing and construct a semantic net representation of the structural description of the
object(s). This structural description was then provided as input to the learning program. An
example of such a structural description for the House of Figure 7.2 is shown in Figure 7.3(a). Node
A represents the entire structure, which is composed of two parts: node B, a Wedge, and node C, a
Brick. Figures 7.3(b) and 7.3(c) show descriptions of the two Arch structures of Figure 7.2. These

LEARNING 193

descriptions are identical except for the types of the objects on the top, one is a Brick while the
other is a Wedge. Notice that the two supporting objects are related not only by left-of and right-of
links, but also by a does-not-marry link, which says that the two objects do not marry. Two objects
marry if they have faces that touch, and they have a common edge. The marry relation is critical in
the definition of an Arch. It is the difference between the first arch structure and the near miss

arch structure shown in Figure 7.2.
Figure 7.2: Some Blocks World Concepts

The basic approach that Winston's program took to the problem of concept formation can be
described as follows:

1. Begin with a structural description of one known instance of the concept. Call that description
the concept definition.

2. Examine descriptions of other known instances of the concept. Generalize the definition to
include them.

3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these.

Steps 2 and 3 of this procedure can be interleaved.

Steps 2 and 3 of this procedure rely heavily on a comparison process by which similarities and
differences between structures can be detected. This process must function in much the same
way as does any other matching process, such as one to determine whether a given production
rule can be applied to a particular problem state. Because differences as well as similarities must


be found. The procedure must perform not just literal but also approximate matching. The output
of the comparison procedure is a skeleton structure describing the commonalities between the
two input structures. It is annotated with a set of comparison notes that describe specific
similarities and differences between the inputs.

LEARNING 195

To see how this approach works, we trace it through the process of learning what an arch is.
Suppose that the arch description of Figure 7 .3(b) is presented first. It then becomes the definition

of the concept Arch. Then suppose that the arch description of Figure 7.3(c) is presented. The


comparison routine will return a structure similar to the two input structures except that it will note
that the objects represented by the nodes labeled C are not identical. This structure is shown as
Figure 7.4. The c-note link from node C describes the difference found by the comparison routine.
It notes that the difference occurred in the isa link, and that in the first structure the isa link pointed
to Brick, and in the second it pointed to Wedge. It also notes that if we were to follow isa links from
Brick and Wedge, these links would eventually merge. At this point, a new, description of the
concept Arch can be generated. This description could say simply that node C must be either a
Brick or a Wedge. But since this particular disjunction has no previously known significance, it is
probably better to trace up the isa hierarchies of Brick and Wedge until they merge. Assuming that
that happens at the node Object, the Arch definition shown in Figure 7.4 can be built.
Figure 7.3: The Structural Description

Figure 7.4

Figuer 7.4

LEARNING 197

Figure 7.5: The Arch Description after Two Examples

Next, suppose that the near miss arch is presented. This time, the comparison routine will note
that the only difference between the current definition and the near miss is in the does-not-marry link
between nodes B and D. But since this is a near miss, we do not want to broaden the definition to
include it. Instead, we want to restrict the definition so that it is specifically excluded. To do this,
we modify the link does-not-marry, which may simply be recording something that has happened by
chance to be true of the small number of examples that have been presented. It must now say
must-not-marry. Actually, must-not-marry should not be a completely new link. There must be some
structure among link types to reflect the relationships between marry, does-not-marry, and must-not-
marry.

Notice how the problem-solving and knowledge representation techniques we covered in earlier
chapters are brought to bear on the problem of learning. Semantic networks were used to
describe block structures, and an isa hierarchy was used to describe relationships among already
known objects. A matching process was used to detect similarities and differences between
structures, and hill climbing allowed the program to evolve a more and more accurate concept
definition.

This approach to structural concept learning is not without its problems. One major problem is
that a teacher must guide the learning program through a carefully chosen sequence of
examples. In the next section, we explore a learning technique that is insensitive to the order in
which examples are presented.

Example

Near Misses:


Goal

Winston's Arch Learning Program models how we humans learn a concept such as "What is an
Arch"?

This is done by presenting to the "student" (the program) examples and counter examples (which
are however very similar to examples). The latter are called "near misses."

The program builds (Winston's version of) a semantic network. With every example and with
every near miss, the semantic network gets better at predicting what exactly an arch is.

Learning Procedure W

1. Let the description of the first sample which must be an example, be the initial description.

2. For all other samples.

- If the sample is a near-miss call SPECIALIZE

- If the sample is an example call GENERALIZE

Specialize

• Establish parts of near-miss to parts of model matching

• IF there is a single most important difference THEN:

– If the model has a link that is not in the near miss, convert it into a MUST link

– If the near miss has a link that is not in the model, create a MUST-NOT-LINK, ELSE
ignore the near-miss.

Generalize

1. Establish a matching between parts of the sample and the model.

2. For each difference:

LEARNING 199

– If the link in the model points to a super/sub-class of the link in the sample a MUST-BE
link is established at the join point.

– If the links point to classes of an exhaustive set, drop the link from the model.

– Otherwise create a new common class and a MUST-BE arc to it.

– If a link is missing in model or sample, drop it in the other.

– If numbers are involved, create an interval that contains both numbers.

– Otherwise, ignore the difference.

Because the only difference is the missing support, and it is near miss, that are must be
important.

New Model


This uses specialize.(Actually it uses it twice, this is an exception).

However, the following principles may be applied.

1. The teacher must use near misses and examples judiciously.

2. You cannot learn if you cannot know (=> KR)

3 You cannot know if you cannot distinguish relevant from irrelevant features. (=> Teachers
can help here, by new missing.)

4. No-Guessing Principle: When there is doubt what to learn, learn nothing.

5. Learning Conventions called "Felicity Conditions" help. The teacher knows that there should

be only one relevant difference, and the Student uses it.

6. No-Altering Principle: If an example fails the model, create an exception. (Penguins are
birds).

7. Martin's Law: You can't learn anything unless you almost know it already.

Learning Rules from Examples

LEARNING 201

Given are 8 instances:

1. DECCG

2. AGDBC

3. GCFDC

4. CDFDE

5. CEGEC
(Comment: "These are good")

1. CGCGF

2. DGFCD

3. ECGCD
(Comment: "These are bad")

The following rules perfectly describes warnings:

(M(1,C) ∧ M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F)) ∨

(M(1,D) ∧ M(2,G) ∧ M(3,F) ∧ M(4,C) ∧ M(5,D)) ∨

(M(1,E) ∧ M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) ->Warning

Now we eliminate the first conjunct from all 3 disjuncts:

(M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F) ) ∨

(M(2,G) ∧ M(3,F) ∧ m(4,C) ∧ M(5,D)) ∨

(M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) -> Warning

Now we check whether any of the good strings 1 - 5 would cause a warning.

1, 4, 5 are out at the first position. (Actually it is the second now). 2, 3 are out at the next (third)
position.

Thus, I don't really need the first conjunct in the Warning rule!

I keep repeating this dropping business, until I am left with

M(5,F) V M(5,D)-> warning


This rule contains all the knowledge contained in the examples!

Note that this elimination approach is order dependant!! Working from right to left you would get:

(M(1,C) ∧ M(2,6)) ∨ ( M(1,D) ∧ M(2,6)) ∨ M(1,E) -> warning

Therefore the resulting rule is order dependent! There are N! (factorial!!!) orders we could try to
delete features! Also there is no guarantee that the rules will work correctly on new instances. In
fact rules are not even guaranteed to exist!

To deal with N! problem, try to find "higher level features". For instance, if 2 features always occur
together, replace them by one feature.

Student Activity 7.2

Before reading next section, answer the following questions.

1. What do you mean by learning by induction. Explain with example?

2. Describe learning by Macro Operators?

3. Explain why a good knowledge representation is useful in reasoning knowledge.

If your answers are correct, then proceed to the next section.

Top

Explanation-Based Learning

The previous section illustrated how we can induce concept descriptions from positive and
negative examples. Learning complex concepts using these procedures typically requires a
substantial number of training instances. But people seem to be able to learn quite a bit from
single examples. Consider a chess player who, as Black, has reached the position shown in
Figure 7.6. The position is called a "fork" because the white knight attacks both the black king and
the black queen. Black must move the king, thereby leaving the queen open to capture. From this
single experience, Black is able to learn quite a bit about the fork trap: the idea is that if any piece
x attacks both the opponent's king and another piece y, then piece y will be lost. We don't need to
see dozens of positive and negative examples of fork positions in order to draw these
conclusions. From just one experience, we can learn to avoid this trap in the future and perhaps
to use it to our own advantage.

LEARNING 203

What makes such single-example learning possible? The answer, not surprisingly, is knowledge.
The chess player has plenty of domain-specific knowledge that can be brought to bear, including
the rules of chess and any previously acquired strategies. That knowledge can be used to identify
the critical aspects of the training example. In the case of the fork, we know that the double
simultaneous attack is important while the precise position and type of the attacking piece is not.

Much of the recent work in machine learning has moved away from the empirical, data-intensive
approach described in the last section toward this more analytical, knowledge-intensive approach.
A number of independent studies led to the characterization of this approach as explanation-based
learning. An EBL system attempts to learn from a single example x by explaining why x is an
example of the target concept. The explanation is then generalized, and the system's
performance is improved through the availability of this knowledge.

Figure 7.6: A Fork Position in Chess

Mitchell et al. and DeJong and Mooney both describe general frame works for EBL programs and
give general learning algorithms. We can think of EBL programs as accepting the following as
input:

1. A Training Example-What the learning program "sees" in the world?

2. A Goal Concept-A high-level description of what the program is supposed to learn

3. An Operationality Criterion-A description of which concepts are usable

4. A Domain Theory-A set of rules that describe relationships between objects and actions in a
domain.


From this, EBL computes a generalization of the training example that is sufficient to describe the
goal concept, and also satisfies the operationality criterion. Let's look more closely at this
specification. The training example is a familiar input-it is the same thing as the example in the
version space algorithm. The goal concept is also familiar, but in previous sections, we have
viewed the goal concept as an ~ output of the program, not an input. The assumption here is that
the goal concept is not operational, just like the high-level card-playing advice. An EBL program
seeks to operationalize the goal concept by expressing it in terms that a problem-solving program
can understand. These terms are given by the operationality criterion. In the chess example, the
goal concept might be something like "bad position for Black," and the operationalized concept
would be a generalized description of situations similar to the training example, given in terms of
pieces and their relative positions. The last input to an EBL program is a domain theory, in our
case, the rules of chess. Without such knowledge, it is impossible to come up with a correct
generalization of the training example.

Explanation-based generalization (EBG) is an algorithm for EBL described in Mitchell et al. It has two
steps: (1) explain and (2) generalize. During the first step, the domain theory is used to prune
away all the unimportant aspects of the training example with respect to the goal concept. What is
left is an explanation of why the training example is an instance of the goal concept. This
explanation is expressed in terms that satisfy the operationality criterion. The next step is to
generalize the explanation as far as possible while still describing the goal concept. Following our
chess example, the first EBL step chooses to ignore White's pawns, king, and rook, and
constructs an explanation consisting of White's knight, Black's king, and Black's queen, each in
their specific positions. Operationality is ensured: all chess-playing programs understand the
basic concepts of piece and position. Next, the explanation is generalized. Using domain
knowledge, we find that moving the pieces to a different part of the board is still bad for Black. We
can also determine that other pieces besides knights and queens can participate in fork attacks.

In reality, current EBL methods run into difficulties in domains as complex as chess, so we will not
pursue this example further. Instead, let's look at a simpler case. Consider the problem of
learning the concept Cup. Unlike the arch-learning program we want to be able to generalize from
a single example of a cup. Suppose the example is:

Training Example:

owner(Object23, Ralph) ∧ has-part(Object23, Concavity12) ∧

LEARNING 205

is(Object23, Light) ∧ color(Object23, Brown) ∧ ...

Clearly, some of the features of Object23 are more relevant to its being a cup than others. So far in
this unit we have seen several methods for isolating relevant features. All these methods require
many positive and negative examples. In EBL we instead rely on domain knowledge, such as:

Domain Knowledge:

is(x, Light) ∧ has-part(x, y) ∧ isa(y, Handle) → liftable(x)

has-part(x, y) ∧ isa(y, Bottom) ∧ is(y, Flat) → stable(x)

has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) → open-vessel(x)

We also need a goal concept to operationalize:

Goal Concept: Cup

x is a Cup if x is liftable, stable, and open-vessel.

OperationalityCriterion: Concept definition must be expressed in purely structural terms (e.g.,
Light, Flat, etc.).

Given a training example and a functional description, we want to build a general structural
description of a cup. The first step is to explain why Object23 is a cup. We do this by constructing a
proof, as shown in Figure 7.7. Standard theorem-proving techniques can be used to find such a
proof. Notice that the proof isolates the relevant features of the training example; nowhere in the
proof do the predicates owner and color appear. The proof also serves as a basis for a valid
generalization. If we gather up all the assumptions and replace constants with variables, we get
the following description of a cup:

has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) ∧

has-part(x, z) ∧ isa(z, Bottom) ∧ is(z, Flat) ∧

has-part(x, w) ∧ isa(w, Handle) ∧ is(x, Light)

This definition satisfies the operationality criterion and could be used by a robot to classify
objects.


Figure 7.7: An Explanation

Simply replacing constants by variables worked in this example, but in some cases it is necessary
to retain certain constants. To catch these cases, we must reprove the goal. This process, which
we saw earlier in our discussion of learning in STRIPS, is called goal regression.

As we have seen, EBL depends strongly on a domain theory. Given such a theory, why are
examples needed at all? We could have operationalized the goal concept Cup without reference
to an example, since the domain theory contains all of the requisite information. The answer is
that examples help to focus the learning on relevant operationalizations. Without an example cup,
EBL is faced with the task of characterizing the entire range of objects that satisfy the goal
concept. Most of these objects will never be encountered in the real world, and so the result will
be overly general.

Providing a tractable domain theory is a difficult task. There is evidence that humans do not learn
with very primitive relations. Instead, they create incomplete and inconsistent domain theories.
For example, returning to chess, such a theory might " include concepts like "weak pawn
structure." Getting EBL to work in ill-structured domain theories is an active area of research.

EBL shares many features of all the learning methods described in earlier sections. , Like concept
learning, EBL begins with a positive example of some concept. As in learning by advice taking,
the goal is to operationalize some piece of knowledge. And EBL techniques, like the techniques
of chunking and macro-operators, are often used to improve the performance of problem-solving
engines. The major difference between EBL and other learning methods is that EBL programs are
built to take advantage of domain knowledge. Since learning is just another kind of problem
solving, it should come as no surprise that there is leverage to be found in knowledge.

LEARNING 207

Top

Learning Automation

The theory of learning automata was first introduced in 1961 (Tsetlin, 1961). Since that time these
systems have been studied intensely, both analytically and through simulations (Lakshmivarahan,
1981). Learning automata systems are finite set adaptive systems which interact iteratively with a
general environment. Through a probabilistic trial-and-error response process they learn to
choose or adapt to a behaviour that produces the best response. They are, essentially, a form of
weak, inductive learners.

In Figure 7.8, we see that the learning model for learning automata has been simplified for just
two components, an automaton (learner) and an environment. The learning cycle begins with an
input to the learning automata system from the environment. This input elicits one of a finite
number of possible responses and then provides some form of feedback to the automaton in
return. The automaton to alter its stimulus-response mapping structure to improve its behaviour in
a more favourable way uses this feedback.

As a simple example, suppose a learning automata is being used to learn the best temperature
control setting for your office each morning. It may select any one of ten temperature range
settings at the beginning of each day (Figure 7.9). Without any prior knowledge of your
temperature preferences, the automaton randomly selects a first setting using the probability
vector corresponding to the temperature settings.

Figure 7.8: Learning Automaton Model


Figure 7.9: Temperature Control Model

Since the probability values are uniformly distributed, any one of the settings will be selected with
equal likelihood. After the selected temperature has stabilized, the environment may respond with
a simple good-bad feedback response. If the response is good, the automata will modify its
probability vector by rewarding the probability corresponding to the good setting with a positive
increment and reducing all other probabilities proportionately to maintain the sum equal to 1. If the
response is bad, the automaton will penalize the selected setting by reducing the probability
corresponding to the bad setting and increasing all other values proportionately. This process is
repeated each day until the good selections have high probability values and all bad choices have
values near zero. Thereafter, the system will always choose the good settings. If, at some point,
in the future your temperature preferences change, the automaton can easily readapt.

Learning automata have been generalized and studied in various ways. One such generalization
has been given the special name of collective learning automata (CLA). CLAs are standard
learning automata systems except that feedback is not provided to the automaton after each
response. In this case, several collective stimulus-response actions occur before feedback is
passed to the automaton. It has been argued (Bock, 1976) that this type of learning more closely
resembles that of human beings in that we usually perform a number or group of primitive actions
before receiving feedback on the performance of such actions, such as solving a complete
problem on a test or parking a car. We illustrate the operation of CLAs with an example of
learning to play the game of Nim in an optimal way.

Nim is a two-person zero-sum game in which the players alternate in removing tokens from an
array that initially has nine tokens. The tokens are arranged into three rows with one token in the
first row, three in the second row, and five in the third row (Figure 7.10).

LEARNING 209

Figure 7.10: Nim Initial Configuration

The first player must remove at least one token but not more than all the tokens in any single row.
Tokens can only be removed from a single row during each payer’s move. The second player
responds by removing one or more tokens remaining in any row. Players alternate in this way
until all tokens have been removed; the loser is the player forced to remove the last token.

We will use the triple (n1, n2, n3) to represent the states of the game at a given time where n 1, n2
and n3 are the numbers of tokens in rows 1, 2, and 3, respectively. We will also use a matrix to
determine the moves made by the CLA for any given state. The matrix of Figure 7.11 has heading
columns which correspond to the state of the game when it is the CLA’s turn to move, and row
headings which correspond to the new game state after the CLA’s turn to move, and row
headings which correspond to the new game state after the CLA has completed a move.
Fractional entries in the matrix are transition probabilities used by the CLA to execute each of its
moves. Asterrisks in the matrix represent invalid moves.

Beginning with the initial state (1, 3, 5), suppose the CLA’s opponent removes two tokens from
the third row resulting in the new state (1, 3, 3). If the ClA then removes all three tokens from the
second row, the resultant state is (1, 0, 3). Suppose the opponent now removes all remaining
tokens from the third row. This leaves the CLA with a losing configuration of (1, 0, 0).

Figure 7.11: CLA Internal Representation of Game States

As the start of the learning sequence, the matrix is initialized such that the elements in each
column are equal (uniform) probability values. For example, since there are eight valid moves


from the state (1, 3, 4) each column element under this state corresponding to a valid move has
been given uniform probability values corresponding to all valid moves for the given column state.

The CLA selects moves probabilistically using the probability values in each column. So, for
example, if the CLA had the first move, any row intersecting with the first column not containing
1
an asterisk would be chosen with probability 9 . This choice then determines the new game

state from which the opponent must select a move. The opponent might have a similar matrix to
record game states and choose moves. A complete game is played before the CLA is given any
feedback, at which time it is informed whether or not its responses were good or bad. This is the
collective feature of the CLA.

If the CLA wins a game, increasing the probability value in each column corresponding to the
winning move rewards all moves made by the CLA during that game. All non-winning probabilities
in those columns are reduced equally to keep the sum in each column equal to 1. If the CLA loses
a game, reducing the probability values corresponding to each losing move penalizes the moves
leading to that loss. All other probabilities in the columns having a losing move are increased
equally to keep the column totals equal to 1.

After a number of games have been played by the CLA, the matrix elements that correspond to
repeated wins will increase toward one, while all other elements in the column will decrease
toward zero. Consequently, the CLA will choose the winning moves more frequently and thereby
improve its performance.

Simulated games between a CLA and various types of opponents have been performed and the
results plotted (Bock, 1985). It was shown, for example, that two CLAs playing against each other
required about 300 games before each learned to play optimally. Note, however, that
convergence to optimality can be accomplished with fewer games if the opponent always plays
optimally (or poorly), since, in such a case, the CLA will repeatedly lose (win) and quickly reduce
(increase) the losing (winning) move elements to zero (one). It is also possible to speed up the
learning process through the use of other techniques such as learned heuristics.

Learning systems based on the learning automaton or CLA paradigm are fairly general for
applications in which a suitable state representation scheme can be found. They are also quite
robust learners. In fact, it has been shown that an LA will converge to an optimal distribution
under fairly general conditions if the feedback is accurate with probability greater 0.5 (Narendra

LEARNING 211

and Thathachar, 1974). Of course, the rate of convergence is strongly dependent on the reliability
of the feedback.

Learning automata are not very efficient learners as was noted in the game-playing example
above. They are, however, relatively easy to implement, provided the number of states is not too
large. When the number of states becomes large, the amount of storage and the computation
required to update the transition matrix becomes excessive.

Potential applications for learning automata include adaptive telephone routing and control. Such
applications have been studied using simulation programs (Narendra et al., 1977).

Genetic Algorithms

Genetic algorithm learning methods are based on models of natural adaptation and evolution.
These learning systems improve their performance through processes that model population
genetics and survival of the fittest. They have been studied since the early 1960s (Holland, 1962,
1975).

In the field of genetics, a population is subjected to an environment that places demands on the
members. The members that adapt well are selected for mating and reproduction. The offspring
of these better performers inherit genetic traits from both their parents. Members of this second
generation of offspring, which also adapt well, are then selected for mating and reproduction and
the evolutionary cycle continues. Poor performers die off without leaving offspring. Good
performers produce good offspring and they, in turn, perform well. After some number of
generations, the resultant population will have adapted optimally or at least very well to the
environment.

Genetic algorithm systems start with a fixed size population of data structures that are used to
perform some given tasks. After requiring the structures to execute the specified tasks some
number of times, the structures are rated on their performance, and a new generation of data
structures is then created. Mating the higher performing structures to produce offspring creates
the new generation. These offspring and their parents are then retained for the next generations
while the poorer performing structures are discarded. The basic cycle is illustrated in Figure 7.12.

Mutations are also performed on the best performing structures to insure that the full space of
possible structures is reachable. This process is repeated for a number of generations until the
resultant population consists of only the highest performing structures.


Data structures that make up the population can represent rules or any other suitable types of
knowledge structure. To illustrate the genetic aspects of the problem, assume for simplicity that
the populations of structures are fixed-length binary strings such as the eight-bit string 11010001.
An initial population of these eight-bit strings would be generated randomly or with the use of
heuristics at time zero. These strings, which might be simple condition an action rules, would then
be assigned some tasks to perform (like predicting the weather based on certain physical and
geographic conditions or diagnosing a fault in a piece of equipment).

Figure 7.12: Genetic Algorithm

After multiple attempts at executing the tasks, each of the participating structures would be rated
and tagged with a utility value, u, commensurate with its performance. The next population would
then be generated using the higher performing structures as parents and the process would be
repeated with the newly produced generation. After many generations the remaining population
structures should perform the desired tasks well.

Mating between two strings is accomplished with the crossover operation which randomly selects
a bit position in the eight-bit string and concatenates the head of one parent to the tail of the
second parent to produce the offspring. Suppose the two parents are designated as xxxxxxxx and
yyyyyyyy respectively, and suppose the third bit position has been selected as the crossover point
(at the position of the colon in the structure xxx: xxxxx). After the crossover operation is applied,

LEARNING 213

two offspring are then generated, namely xxxyyyyy and yyyxxxxx. Such offspring and their
parents are then used to make up the next generation of structures.

A second genetic operation often used is called inversion. Inversion is a transformation applied to
a single string. A bit position is selected at random, and when applied to a structure; the inversion
operation concatenates the tail of the string to the head of the same string. Thus, if the sixth
position were selected (x1x2x3x4x5x6x7x8), the inverted string would be x7x8x1x2x3x4x5x6.

A third operator, mutation, is used to insure that all locations of the rule space are reachable, that
every potential rule in the rule space is available for evaluation. This insures that the selection
process does not get caught in a local minimum. For example, it may happen that use of the
crossover and inversion operators will only produce a set of structures that are better than all
local neighbors but not optimal in a global sense. This can happen since crossover and inversion
may not be able to produce some undiscovered structures. The mutation operators can overcome
this by simply selecting any bit position in a string at random and changing it. This operator is
typically used only infrequently to prevent random wandering in the search space.

The genetic paradigm is best understood through an example. To illustrate similarities between
the learning automation paradigm and the genetic paradigm we use the same learning task of the
previous section, namely learning to play the game of Nim optimally. We use a slightly different
representation scheme here since we want a population of structures that are easily transformed.
To do this, we let each member of the population consist of a pair of triplets augmented with a
utility value u, ((n1, n2, n3) (m1, m2, m3) u), where the first pair is the game state presented to the
genetic algorithm system prior to its move, and the second triple is the state after the move. The u
values represent the worth or current utility of the structure at any given time.

Before the game begins, the genetic system randomly generates an initial population of K triple-
pair members. The population size K is one of the important parameters that must be selected.
Here, we simply assume it is about 25 or 30, which should be more than the number of moves
needed for any optimal play. All members are assigned an initial utility value of 0. The learning
process then proceeds as follows:

1. The environment presents a valid triple to the genetic system.

2. The genetic system searches for all population triple pairs that have a first triple that matches
the input triple. From those that match, the first one found having the highest utility value u is
selected and the second triple is returned as the new game state. If no match is found, the


genetic system randomly generates a triple which represents a valid move, returns this as
the new state, and stores the triple pairs, the input, and newly generated triple as an addition
to the population.

3. The above two steps are repeated until the game is terminated, in which case the genetic
system is informed whether a win or loss occurred. If the system wins, each of the
participating member moves has its utility value increased. If the system loses, each
participating member has its utility value decreased.

4. The above steps are repeated until a fixed number of games have been played. At this time
a new generation is created.

The new generation is created from the old population by first selecting a fraction (say one half) of
the members having the highest utility values. From these, offspring are obtained by application
of appropriate genetic operators.

The three operators, crossover, inversion, and mutation, randomly modify the parent moves to
give new offspring move sequences. (The best choice of genetic operators to apply in this
example is left as an exercise). Each offspring inherits a utility value from one of the parents.
Population members having low utility values are discarded to keep the population size fixed.

This whole process is repeated until the genetic system has learned all the optimal moves. This
can be determined when the parent population ceases to change or when the genetic system
repeatedly wins.

The similarity between the learning automaton and genetic paradigms should be apparent from
this example. Both rely on random move sequences, and the better moves are rewarded while
the poorer ones are penalized.

Neural Network

Neural networks are large networks of simple processing elements or nodes that process
information dynamically in response to external inputs. The nodes are simplified models of
neurons. The knowledge in a neural network is distributed throughout the network in the form of
internode connections and weighted links that form the inputs to the nodes. The link weights
serve to enhance or inhibit the input stimuli values that are then added together at the nodes. If

LEARNING 215

the sum of all the inputs to a node exceeds some threshold value T, the node executes and
produces an output that is passed on to other nodes or is used to produce some output response.
In the simplest case, no output is produced if the total input is less than T. In more complex
models, the output will depend on a nonlinear activation function.

Neural networks were originally inspired as being models of the human nervous system. They are
greatly simplified models to be sure (neurons are known to be fairly complex processors). Even
so, they have been shown to exhibit many "intelligent" abilities, such as learning, generalization,
and abstraction.

A single node is illustrated in Figure 7.13. The inputs to the node are the values X 1, X2 . . . , Xn
which typically take on values of -1, 0, 1, or real values within the range (—1,1). The weights w1,
w2,. . . ., wn, correspond to the synaptic strengths of a neuron. They serve to increase or decrease
the effects of the corresponding x, input values. The sum of the products x i × wi i = 1,2,. . . , n,
serve as the total combined input to the node. If this sum is large enough to exceed the threshold
amount T, the node fires, and produces an output y, an activation function value placed on the
node’s output links. The output may then be the input to other nodes or the final output response
from the network.

Figure 7.13: Model of a single neuron (node)

Figure 7.14 illustrates three layers of a number of interconnected nodes. The first layer serves as
the input layer, receiving inputs from some set of stimuli. The second layer (called the hidden
layer layer) receive inputs from the first layer and produces a pattern of inputs to the third layer,
the output layer. The. pattern of outputs from the final layer are the network's responses to the
input stimuli patterns. Input links to layer j(j=1, 2, 3) have weights wij for I=1, 2,…….,n.

General multiplayer networks having n nodes (number of rows) in each of m layers (number of
columns of nodes) will have weights represented as an n x m matrix W. Using this representation,
nodes having no interconnecting links will have a weight value of zero. Networks consisting of


more than three layers would, of course, be correspondingly more complex than the network
depicted in Figure 7.14.

A neural network can be thought of as a black box that transforms the input vector x to the output
vector y where the transformation performed is the result of the pattern of connections and
weights, that is, according to the values of the weight matrix W.

Consider the vector product
| x × w | = ∑x w
i i

There is a geometric interpretation for this product. It is equivalent to projecting one vector onto
the other vector in n-dimensional space. This notion is depicted in Figure 7.15 for the two-
dimensional case. The magnitude of the resultant vector is given by
x × w = x || w | cos θ
|

here |x| denotes the norm or length of the vector x. Note that this product is maximum when both
vectors point in the same direction, that is, when θ = 0. The product is a minimum when both
point in opposite directions or when θ=180 degrees. This illustrates how the vectors in the
weight matrix W influence the inputs to the nodes in a neutral network.

Figure 7.14: A Multiplayer Neural Network

LEARNING 217

Figure 7.15: Vector Multiplication is Like Vector Projecton

Learning pattern weights

The interconnections and weights W in the neural network store the knowledge possessed by the
network. These weights must be preset or learned in some manner. When learning is used, the
process may be either supervised or unsupervised. In the supervised case, learning is performed
by repeatedly presenting the network with an input pattern and a desired output response. The
training examples then consist of the vector pairs (x, y'), where x is the input pattern and y' is the
desired output response pattern. The weights are then adjusted until the difference between the
actual output response y and the desired response y' are the same, that is until D = y- y' is near
zero.

One of the simpler supervised learning algorithms uses the following formula to adjust the weights
W.
x
Wnew = W + a × D ×
old | x| 2

Where 0 < a < 1 is a learning constant that determines the rate of learning. When the difference D
is large, the adjustment to the weights W is large, but when the output response y is close to the
target response y' the adjustment will be small. When the difference D is near zero, the training
process terminates at which point the network will produce the correct response for the given
input patterns x.

In unsupervised learning, the training examples consist of the input vectors x only. No desired
response y' is available to guide the system. Instead, the learning process must find the weights
w, with no knowledge of the desired output response.


A neural net expert system

An example of a simple neural network diagnostic expert system has been described by Stephen
Gallant (1988). This system diagnoses and recommends treatments for acute sarcophagal
disease. The system is illustrated in Figure 7.16. From six symptom variables u 1, u2,……., u6, one
of two possible diseases can be diagnosed, u7 or ug. From the resultant diagnosis, one of three
treatments, u9, u10, or u11 can then be recommended.

Figure 7.16: A Simple Neural Network Expert System.

When a given symptom is present, the corresponding variable is given a value of +1 (true).
Negative symptoms are given an input value of -1 (false), and unknown symptoms are given the
value 0. Input symptom values are multiplied by their corresponding weights W i0 Numbers within
the nodes are initial bias weights Wi0 and numbers on the links are the other node input weights.
When the sum of the weighted products of the inputs exceeds 0, an output will be present on the
corresponding node output and serve as an input to the next layer of nodes.

LEARNING 219

As an example, suppose the patient has swollen feet ⇐ +1) but not red ears (U2 = -1) nor hair
loss (U8 = -1). This gives a value of U 7 = +1 (since 0+(2)(l)+(-2)(-l)+(3)(-l) = 1), suggesting the
patient has superciliosis.

When it is also known that the other symptoms of the patient are false (U 5 = U6 = l), it may be
concluded that namatosis is absent (U8 = -1), and therefore that birambio (U10 = +1) should be
prescribed while placibin should not be prescribed (U9 = -1). In addition, it will be found that
posiboost should also be prescribed (u11 = +1).

The training algorithm added the intermediate triangular shaped nodes. These additional nodes
are needed so that weight assignments can be made which permit the computations to work
correctly for all training instances.

Deductions can be made just as well when only partial information is available. For example,
when a patient has swollen feet and suffers from hair loss, it may be concluded the patient has
superciliosis, regardless of whether or not the patient has red ears. This is so because the
unknown variable cannot force the sum to change to negative.

A system such as this can also explain how or why a conclusion was reached. For example,
when inputs and outputs are regarded as rules, an output can be explained as the conclusion to a
rule. If placibin is true, the system might explain why with a statement such as:

Placibin is TRUE due to the following rule:

IF Ptacibin Alergy (u6) is FALSE, and Superciliosis is TRUE

THEN Conclude Placibin is TRUE.

Expert systems based on neural network architectures can be designed to possess many of the
features of other expert system types, including explanations for how and why, and confidence
estimation for variable deduction.

Top

Learning in Neural Networks

The perception, an invention of Rosenblatt [1962], was one of the earliest neural network models.
A perceptron models a neuron by taking a weighted sum of its inputs and sending the output 1if
the sum is greater than some adjustable threshold value (otherwise it sends 0). Figure 7.17


shows the device. Notice that in a perceptron, unlike a Hopfield network, connections are
unidirectional.

The inputs (x1,x2....xn) and connection weights (w1, w2,... , wn) in the figure are typically real values,
both positive and negative. If the presence of some feature x, tends to cause the perceptron to
fire, the weight w, will be positive; if the feature x; inhibits the perceptron, the weight w, will be
negative. The perceptron itself consists of the weights, the summation processor, and the
adjustable threshold processor. Learning is a process of modifying the values of the weights and
the threshold. It is convenient to implement the threshold as just another weight WQ, as in Figure
7.18. This weight can be thought of as the propensity of the perceptron to fire irrespective of its
inputs. The perceptron of Figure 7.18 fires if the weighted sum is greater than zero.

A perceptron computes a binary function of its input. Several perceptrons can be combined to
compute more complex functions, as shown in Figure 7.19.

Figure 7.17: A Neuron and a Perception

Figure 7.18: Perceptron with Adjustable Threshold Implemented as Additional Weight

LEARNING 221

Figure 7.19: A Perceptron with Many Inputs and Many Outputs

Such a group of perceptrons can be trained on sample input-output pairs until it learns to compute
the correct function. The amazing property of perceptron learning is this: Whatever a perceptron
can compute, it can learn to compute. We demonstrate this in a moment. At the time perceptrons
were invented, many people speculated that intelligent systems could be constructed out of
perceptrons (see Figure 7.20).

Since the perceptrons of Figure 7.19 are independent of one another, they can be separately
trained. So let us concentrate on what a single perceptron can learn to do. Consider the pattern
classification problem shown in Figure 7.21. This problem is linearly separable, because we can
draw a line that separates one class from another. Given values for x 1 and x2, we want to train a
perceptron to output 1 if it thinks the input belongs to the class of white dots and 0 if it thinks the
input belongs to the class of black dots. We have no explicit rule to guide us; we must induce a
rule from a set of training instances. We now see how perceptrons can learn to solve such
problems.

First, it is necessary to take a close look at what the perceptron computes. Let x an input vector
(x1, x2,……xn). Notice that the weighted summation function g(x) and the output function o(x) can
be defined as:
n
g(x) = ∑ w x
i =0 i i
1 if g ( x ) > 0
0(x ) = 
0 if g ( x ) > 0

Consider the case where we have only two inputs (as in Figure 7.22). Then:


g (x ) = w 0 + w 1 x 1 + w 2x 2

Figure 7.20: An Early Notion of an Intelligent System Built from Trainable Perceptions

If g(x) is exactly zero, the perceptron cannot decide whether to fire. A slight change in inputs
could cause the device to go either way. If we solve the equation g(x) = 0, we get the equation for
a line:
w w
x =− 1 x − 0
2 w 1 w
2 2
The location of the line is completely determined by the weights w0, w1, and w2. lf an input vector
lies on one side of the line, the perceptron will output 1; if it lies on the other side, the perceptron
will output 0. A line that correctly separates the training instances corresponds to a perfectly
functioning perceptron. Such a line is called a decision surface. In perceptrons with many inputs,
the decision surface will be a hyperplane through the multidimensional space of possible input
vectors. The problem of learning one of locating an appropriate decision surface.

We present a formal learning algorithm later. For now, consider the informal rule:

If the perceptron fires when it should not fire, make each w, smaller by an amount
proportional to xi. If the perceptron fails to fire when it should fire, make each w i larger by a
similar amount.

Suppose we want to train a three-input perceptron to fire only when its first input is n. If the
perceptron fails to fire in the presence of an active x 1, we will increase w1 (and we may increase
other weights). If the perceptron fires incorrectly, we will end up decreasing weights that are not

LEARNING 223

w1. (We will never decrease w1 because undesired firings only occur when x1 is 0, which forces
the proportional change in w1 also to be 0). In addition, w0 will find a value based on the total
number of incorrect firings versus incorrect misfirings. Soon, w 1 will become large enough to
overpower w0, w2 and w3 will not be powerful enough to fire the perception, even in the presence
of both x2 and x3.

Figure 7.21: A Linearly Separable Pattern Classification Problem

Now let us return to the function g(x) and o(x). While the sign of g(x) is critical to determining
whether the perception will fire, the magnitude is also important. The absolute value of g(x) tells

how far a given input vector x lies from the decision surface. This gives us a way of

characterizing how good a set of weights is. Let w be the weight vector (w0, w1,…wn), and let X
be the subset of training instances misclassified by the current set of weights. Then define the
perceptron criterion function, J w ) , to be the sum of the distances of the misclassified input
(
vectors from the decision surface:
 
(
n
J w ) = ∑ ∑Wi Xi = ∑ WX
x ∈X i =0 x ∈X


To create a better set of weights than the current set, we would like to reduce J w ) . Ultimately, if
(

all inputs are classified correctly, J w ) = 0 .
(

How do we go about minimizing J w ) ? We can use a form of local-search hill climbing known as
(

gradient descent. For our current purposes, think of J w ) as defining a surface in the space of
(
all possible weights. Such a surface might look like the one in Figure 7.22.

In the figure, weight w0 should be part of the weight space but is omitted here because it is easier
to visualize J in only three dimensions. Now, some of the weight vectors constitute solutions, in
that a perceptron with such a weight vector will classify all its inputs correctly. Note that there are


 
an infinite number of solution vectors. For any solution vector w s , we know that JWs = 0 . ( )

suppose we begin with a random weight vector w that is not a solution vector. We want to slide
down the J surface. There is a mathematical method for doing this—we compute the gradient of

the function J w ) . Before we derive the gradient function, we reformulate the perception
(
criterion function to remove the absolute value sign:

 
   x if x is misclassif ied as a negative example
J( w ) = ∑ w   
x ∈ X  – x if x is misclassif ied as a negative example

(
Figure 7.22: Adjusting the Weights by Gradient Descent, Minimizing J w )

Recall that X is the set of misclassified input vectors.

Now, here is ∇, the gradient of J w ) with respect to the weight space:
J (

 
  x if X is misclassif ied as a negative example
∇ J( w ) = ∑   
x ∈ X − x if x is misclassif ied as a positive example

The gradient is a vector that tells us the direction to move in the weight space in order to reduce
(
J w ) . In order to find a solution weight vector, we simply change the weights in the direction of

LEARNING 225

 
the gradient, recomputed J w ) , recomputed the new gradient, and iterate until J w ) = 0 . The
( (
rule for updating the weights at time t + 1 is:
 
w t +1 = w t + η∇J

Or in expanded form:


   x if x is misclassfi ed as a negative example
w t+ 1 = w t + η ∑   
x ∈ X − x if x is misclassif ied as a positive example

η is a scale factor that tells us how far to move in the direction of the gradient. A small η will

lead to slower learning, but a large η to be a constant gives us what is usually called the “fixed-
increment perceptron-learning algorithm”:

1. Create a perception with n+1 inputs and n+1 weights, where the extra input X 0 is always set
to 1.

2. Initialize the weights (w0, w1,……wn) to random real values.

3. Iterate through the training set, collecting all examples misclassified by the current set of
weights.

4. If all examples are classified correctly, output the weights and quit.

5. Otherwise, compute the vector sum S of the misclassified input vectors, where each vector
→ →
has the form {X0, X1,…..Xn). In creating the sum, add to S a vector x if x is an input for
→ →
which the perceptron incorrectly fails lo fire, but add vector - x if x is an input for which the
perceptron incorrectly fires. Multiply the sum by a scale factor η.

6. Modify the weights (w0, w1…..wn) by adding the elements of the vector S to them. Go to step
3.

The perceptron-learning algorithm is a search algorithm. It begins in a random initial state and
finds a solution state. The search space is simply all-possible assignments of real values to the
weights of the perceptron, and the search strategy is gradient descent.

So far, we have seen two search methods employed by neural networks, gradient descent in
perceptrons and parallel relaxation in Hopfield networks. It is important to understand the relation


between the two. Parallel relaxation is a problem-solving strategy, analogous to state space
search in symbolic AI. Gradient descent is a learning strategy, analogous to techniques such as
version spaces. In both symbolic and connectionist AI, learning is viewed as a type of problem
solving, and this is why search is useful in learning. But the ultimate goal of learning is to get a
system into a position where it can solve problems better. Do not confuse learning algorithms with
others.

The perceptron convergence theorem, due to Rosenblatt [1962], guarantees that the perceptron
will find a solution state, i.e., it will learn to classify any linearly separable set of inputs. In other
words, the theorem shows that in the weight space, there are no local minima that do not
correspond to the global minimum. Figure 7.23 shows a perceptron learning to classify the
instances of Figure 7.21. Remember that every set of weights specifies some decision surface, in
this case some two-dimensional line. In the figure, k is the number of passes through the training
data, i.e., the number of iterations of steps 3 through 6 of the fixed-increment perceptron-learning
algorithm.

The introduction of perceptrons in the late 1950s created a great deal of excitement, here was a
device that strongly resembled a neuron and for which well-defined learning algorithms was
available. There was much speculation about how intelligent systems could be constructed from
perceptron building blocks. In their book Perceptrons, Minsky and Papert put an end to such
speculation by analyzing the computational capabilities of the devices. They noticed that while the
convergence theorem guaranteed correct classification of linearly separable data, most problems
do not supply such nice data. Indeed, the perceptron is incapable of learning to solve some very
simple problems.

Figure 7.23: A Perceptron Learning to Solve a Classification Problem

Chaptr 7 (final)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Chaptr 7 (final)

Similaire à Chaptr 7 (final) (20)

Plus de Nateshwar Kamlesh

Plus de Nateshwar Kamlesh (8)

Chaptr 7 (final)