Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Machine learning Lecture 2
1. Lecture No. 2
Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University
Date: 8.3.2008
2. Today’s Agenda
• Recap (FIND-S Algorithm)
• Version Space
• Candidate-Elimination Algorithm
• Decision Tree
• ID3 Algorithm
• Entropy
3. Concept Learning as Search
Concept learning can be viewed as the task of searching through
a large space of hypothesis implicitly defined by the hypothesis
representation.
The goal of the concept learning search is to find the hypothesis
that best fits the training examples.
4. General-to-Specific Learning
Every day Tom his
enjoy i.e., Only
positive examples.
Most General Hypothesis: h = <?, ?, ?, ?, ?, ?>
Most Specific Hypothesis: h = < Ø, Ø, Ø, Ø, Ø, Ø>
6. Definition
Given hypotheses hj and hk, hj is more_general_than_or_equal_to
hk if and only if any instance that satisfies hk also satisfies hj.
We can also say that hj is more_specific_than hk when hk is
more_general_than hj.
13. Unanswered Questions by FIND-S
• Has the learner converged to the correct target
concept?
• Why prefer the most specific hypothesis?
• What if the training examples consistent?
14. Version Space
The set of all valid hypotheses provided by an
algorithm is called version space (VS) with respect
to the hypothesis space H and the given example set
D.
15. Candidate-Elimination Algorithm
The Candidate-Elimination algorithm finds all describable hypotheses
that are consistent with the observed training examples
Hypothesis is derived from examples regardless of whether x is
positive or negative example
19. LIST-THEN-ELIMINATE Algorithm
to Obtain Version Space
• In principle, the LIST-THEN-ELIMINATE algorithm can be
applied whenever the hypothesis space H is finite.
• It is guaranteed to output all hypotheses consistent with the
training data.
• Unfortunately, it requires exhaustively enumerating all
hypotheses in H-an unrealistic requirement for all but the most
trivial hypothesis spaces.
20. Candidate-Elimination Algorithm
• The CANDIDATE-ELIMINATION algorithm works on the same
principle as the above LIST-THEN-ELIMINATE algorithm.
• It employs a much more compact representation of the version
space.
• In this the version space is represented by its most general and
least general members (Specific).
• These members form general and specific boundary sets that delimit
the version space within the partially ordered hypothesis space.
29. Remarks on Version Spaces and
Candidate-Elimination
The version space learned by the CANDIDATE-ELIMINATION algorithm
will converge toward the hypothesis that correctly describes the target
concept, provided
(1) there are no errors in the training examples, and
(2) there is some hypothesis in H that correctly
describes the target concept.
35. Remarks on Version Spaces and
Candidate-Elimination
The target concept is exactly learned when
the S and G boundary sets converge to a
single, identical, hypothesis.
36. Remarks on Version Spaces and
Candidate-Elimination
How Can Partially Learned Concepts Be Used?
Suppose that no additional training examples are available beyond
the four in our example. And the learner is now required to classify
new instances that it has not yet observed.
The target concept is exactly learned when
the S and G boundary sets converge to a
single, identical, hypothesis.
38. Remarks on Version Spaces and
Candidate-Elimination
All six hypotheses satisfied
All six hypotheses satisfied
39. Remarks on Version Spaces and
Candidate-Elimination
Three hypotheses satisfied
Three hypotheses not satisfied
Two hypotheses satisfied
Four hypotheses not satisfied
42. Decision Trees
• Decision tree learning is a method for approximating
discrete value target functions, in which the learned function
is represented by a decision tree.
• Decision trees can also be represented by if-then-else rule.
• Decision tree learning is one of the most widely used
approach for inductive inference .
43. Decision Trees
An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example. This process
is then repeated for the subtree rooted at the new node.
45. Decision Trees
Edges: Attribute
value
Intermediate
Nodes: Attributes
Attribute: A1
Attribute Attribute
value Attribute
value
value
Attribute: A2 Output Attribute: A3
value
Attribute
Attribute Attribute Attribute
value
value value value
Output Output Output Output
value value value value
Leave node:
Output value
46. Decision Trees
conjunction
disjunction
Decision trees represent a disjunction of conjunctions of constraints
on the attribute values of instances.
Each path from the tree root to a leaf corresponds to a conjunction of
attribute tests, and the tree itself to a disjunction of these
conjunctions.
48. Decision Trees (F = A ^ B')
F = A ^ B‘
If (A=True and B = False) then Yes
else
No
If then else form
A
False True
No
B
False True
Yes No
49. Decision Trees (F = A V (B ^ C))
If (A=True) then Yes
else if (B = True and C=True) then Yes If then else form
else No
A
True False
Yes
B
False True
No C
False True
No Yes
50. Decision Trees (F = A XOR B)
F = (A ^ B') V (A' ^ B)
If (A=True and B = False) then Yes
If then else form
else If (A=False and B = False) then Yes
else No
A
False True
B B
False False True
True
No Yes No
Yes
51. Decision Trees as If-then-else rule
conjunction
disjunction
If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes
If (Outlook = Overcast) then PlayTennis = Yes
If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes
52. Problems Suitable for Decision Trees
• Instances are represented by attribute-value pairs
Instances are described by a fixed set of attributes (e.g., Temperature) and
their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on a small number of disjoint possible values (e.g., Hot,
Mild, Cold). However, extensions to the basic algorithm allow handling real-
valued attributes as well (e.g., representing Temperature numerically).
• The target function has discrete output values
• Disjunctive descriptions may be required
• The training data may contain errors
• The training data may contain missing attribute values
53. Basic Decision Tree Learning Algorithm
• ID3 Algorithm (Quinlan 1986) and it’s
successors C4.5 and C5.0
• Employs a top-down
An instance is classified by starting at the root
node of the tree, testing the attribute specified
by this node, then moving down the tree
branch corresponding to the value of the
attribute in the given example. This process is
then repeated for the subtree rooted at the
new node.
• Greedy search the space of possible
http://www.rulequest.com/Personal/
decision trees.
The algorithm never backtracks to
reconsider earlier choices.
58. Building Decision Tree
Attribute: A1
Attribute value
Attribute value
Attribute
value
Output value
Attribute: A2 Attribute: A3
Attribute value
Attribute value Attribute value Attribute value
Output value Output value
Output value Output value
59. Building Decision Tree
Outlook
Temperature Which attribute to
select ?????
Humidity
Wind
Root
node
60. Which Attribute to Select ??
• We would like to select the attribute that is most useful for
classifying examples.
• What is a good quantitative measure of the worth of an
attribute?
ID3 uses this information gain measure to select among the
candidate attributes at each step while growing the tree.
61. Information Gain
Information gain is based on information theory concept called Entropy
“Nothing in life is certain except death,
taxes and the second law of
thermodynamics. All three are
processes in which useful or
accessible forms of some quantity,
such as energy or money, are
transformed into useless, inaccessible
forms of the same quantity. That is not
to say that these three processes don’t
have fringe benefits: taxes pay for
Rudolf Julius Emanuel roads and schools; the second law of
Claude Elwood
Clausius (January 2, thermodynamics drives cars,
Shannon (April 30,
1822 – August 24, 1888), 1916 – February 24, computers and metabolism; and death,
was a German physicist 2001), an American at the very least, opens up tenured
and mathematician and electrical engineer and faculty positions”
is considered one of the mathematician, has
central founders of the been called quot;the father
Seth Lloyd, writing in Nature 430,
science of of information theoryquot; 971 (26 August 2004).
thermodynamics
62. Entropy
• In information theory, the Shannon entropy or
information entropy is a measure of the uncertainty
associated with a random variable.
• It quantifies the information contained in a
message, usually in bits or bits/symbol.
• It is the minimum message length necessary to
communicate information.
63. Why Shannon named his uncertainty
function quot;entropy“ ?
John von
Neumann
My greatest concern was what to call it. I thought of calling it 'information,' but the
word was overly used, so I decided to call it 'uncertainty.' When I discussed it with
John von Neumann, he had a better idea. Von Neumann told me, 'You should call
it entropy, for two reasons. In the first place your uncertainty function has
been used in statistical mechanics under that name, so it already has a name.
In the second place, and more important, no one really knows what entropy
really is, so in a debate you will always have the advantage.'
64. Shannon's mouse
Shannon and his famous
electromechanical mouse
Theseus, named after the Greek
mythology hero of Minotaur and
Labyrinth fame, and which he
tried to teach to come out of the
maze in one of the first
experiments in artificial
intelligence.
65. Entropy
The information entropy of a discrete random variable X, that can take on
possible values {x1...xn} is
where
I(X) is the information content or self-information of X, which is itself a
random variable; and
p(xi) = Pr(X=xi) is the probability mass function of X.
66. Entropy in our Context
Given a collection S, containing positive and negative
examples of some target concept, the entropy of S relative to
this boolean classification (yes/no) is
where is the proportion of positive examples in S and pӨ, is the
proportion of negative examples in S. In all calculations involving
entropy we define 0 log 0 to be 0.
67. Example
There are 14 examples. 9 positive and 5 negative examples [9+, 5-].
The entropy of S relative to this boolean (yes/no) classification is
68. Information Gain Measure
Information gain, is simply the expected reduction in entropy
caused by partitioning the examples according to this attribute.
More precisely, the information gain, Gain(S, A) of an attribute A,
relative to a collection of examples S, is defined as
where Values(A) is the set of all possible values for attribute A,
and Sv, is the subset of S for which attribute A has value v, i.e.,
69. Information Gain Measure
Entropy of S after
Entropy of S
partition
Gain(S, A) is the expected reduction in entropy caused by knowing the value of
attribute A.
Gain(S, A) is the information provided about the target &action value, given the
value of some other attribute A. The value of Gain(S, A) is the number of bits
saved when encoding the target value of an arbitrary member of S, by knowing
the value of attribute A.
70. Example
There are 14 examples. 9 positive and 5 negative examples [9+, 5-].
The entropy of S relative to this boolean (yes/no) classification is
85. Some Insights into Capabilities and
Limitations of ID3 Algorithm
• ID3’s algorithm searches complete hypothesis space. [Advantage]
• ID3 maintain only a single current hypothesis as it searches through
the space of decision trees. By determining only as single
hypothesis, ID3 loses the capabilities that follows explicitly
representing all consistent hypothesis. [Disadvantage]
• ID3 in its pure form performs no backtracking in its search. Once it
selects an attribute to test at a particular level in the tree, it never
backtracks to reconsider this choice. Therefore, it is susceptible to
the usual risks of hill-climbing search without backtracking:
converging to locally optimal solutions that are not globally optimal.
[Disadvantage]
86. Some Insights into Capabilities and
Limitations of ID3 Algorithm
• ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. This contrasts with methods that make decisions
incrementally, based on individual training examples (e.g., FIND-S
or CANDIDATE-ELIMINATION). One advantage of using statistical
properties of all the examples (e.g., information gain) is that the
resulting search is much less sensitive to errors in individual training
examples. [Advantage]