SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
Introduction to Machine Learning

                                                            Jinhyuk Choi
Human-Computer Interaction Lab @ Information and Communications University
Contents
   Concepts of Machine Learning

   Multilayer Perceptrons

   Decision Trees

   Bayesian Networks
What is Machine Learning?
   Large storage / large amount of data

   Looks random but certain patterns
       Web log data
       Medical record
       Network optimization
       Bioinformatics
       Machine vision
       Speech recognition…

   No complete identification of the process
       A good or useful approximation
What is Machine Learning?
Definition
   Programming computers to optimize a
    performance criterion using example data or past
    experience

   Role of Statistics
       Inference from a sample
   Role of Computer science
       Efficient algorithms to solve the optimization problem
       Representing and evaluating the model for inference
   Descriptive (training) / predictive (generalization)
                              Learning from Human-generated data??
What is Machine Learning?
Concept Learning

• Inducing general functions from specific training examples (positive or
  negative)
• Looking for the hypothesis that best fits the training examples

   Objects                              Concept
   눈, 코, 다리    Bird
   생식능력,       날개, 부리,
                                        boolean function :
   …           깃털…                         Bird(animal)  “true or not”
   무생물…




• Concepts:
- describing some subset of objects or events defined over a larger set
    - a boolean-valued function
What is Machine Learning?
Concept Learning

   Inferring a boolean-valued function from training examples of its input and
    output

                                   Hypothesis 1


                                   Hypothesis 2




                                     Concept
                                                        Web log data
                                                        Medical record
                                                        Network optimization
                               Positive examples        Bioinformatics
                               Negative examples        Machine vision
                                                        Speech recognition…
What is Machine Learning?
Learning Problem Design

   Do you enjoy sports ?
     Learn to predict the value of “EnjoySports” for an arbitrary day, based on
      the value of its other attributes




   What problem?
     Why learning?
   Attributes selection
     Effective?
     Enough?
   What learning algorithm?
Applications
   Learning associations
   Classification
   Regression
   Unsupervised learning
   Reinforcement learning
Examples (1)

   TV program preference inference based on web usage data


      Web page #1                                   TV Program #1
      Web page #2                                   TV Program #2
      Web page #3                 Classifier        TV Program #3
      Web page #4       1                       2   TV Program #4
          ….                                             ….


                                      3


     What are we supposed to do at each step?
Examples (2)
  from a HW of Neural Networks Class (KAIST-2002)

     Function approximation (Mexican hat)


                               
f3 ( x1 , x2 )  sin 2 x12  x2 ,
                               2
                                     x1 , x2 [1,1]
Examples (3)
from a HW of Machine Learning Class (ICU-2006)

   Face image classification
Examples (4)
from a HW of Machine Learning Class (ICU-2006)
Examples (5)
from a HW of Machine Learning Class (ICU-2006)

   Sensay
Examples (6)




A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable
Computing”, ISWC 2005
#1. Multilayer Perceptrons
Neural Network?




                  VS.   Adaline
                        MLP
                        SOM
                        Hopfield network
                        RDFN
                        Bifurcating neuron networks
                        …
Multilayer Networks of Sigmoid Units




                             • Supervised learning
                             • 2-layer
                             • Fully connected




                      Really looks like the brain??
Sigmoid Unit
The back-propagation algorithm
  Network model

      Input layer                           hidden layer                      output layer

          xi                                  yj                                    ok


                           v ji
                                                                 wkj




                                                                       
                    y j  s  v ji x i 
                                                                 w y 
                                                           ok  s  kj j 
                                                                          
                             i
                                       
                                                                 
                                                                         
                                                                  j      
                              1
                    E v , w    tk  ok 
                                             2

  Error function:             2 k

      Stochastic gradient descent
Gradient-Descent Function Minimization
Gradient-descent function minimization
                                                                     
 In order to find a vector parameter x that minimizes a function f x  …
                                                 
     Start with a random initial value of   x  x 0 .
     Determine the direction of the steepest descent in the parameter space by

                 f f         f  
                
        f       ,    ,...,      
                 x x
                 1     2       x n 
                                     
                                     

     Move to the direction a step.
                        
           x i  1  x i   hf                        
                                                             x
     Repeat the above two steps until no more change in        .


 For gradient-descent to work…
     The function to be minimized should be continuous.
     The function should not have too many local minima.
Back-propagation
Derivation of back-propagation algorithm

Adjustment of    wkj :
                                                                         2
     E                  1            2   1       t  s  w y 
                                                              
                                                                        
                                                                        
                           tk  ok               k
                                                              k j j 
    wk j    wk j     2 k          2 wk j        
                                                               j
                                                                       
                                                                        
                                                       
              1
               y j ok  1  ok  1 2 tk   ok  
              2
             y j ok  1  ok  tk   ok  

                   E
    wkj  h            h ok 1  ok tk  ok y j
                   wkj     
                                     o        
                                      d      k
Derivation of back-propagation algorithm
   Adjustment of vji :
                                                                                  2
       E                     1            2  1           t  s  w y 
                                                                      
                                                                               
                                                                              
                                tk  ok                       kj j 
                                                                              
       v j i    v j i     2 k          2 k v j i   k
                                                               
                                                               
                                                                      
                                                                       j
                                                                             
                                                                              
                                                                    2
                 1            t  s  w s  v x 
                                                               
                              k
                                          kj  ji i 
                                                              
                 2 k v j i            j
                                                       i
                                                                
                                                                
                 1
                  x i y j  1  y j   wkj ok 1  ok 1 2 tk  ok 
                 2 k

                 x i y j  1  y j    wkj ok 1  ok tk  ok 
                                         k

                         E
       v ji  h               hy j 1  y j   wkjok 1  ok tk  ok x i
                         v ji                   k


                h y j 1  y j   wkj dko x i
                   
                             y k        
                                   dj
Backpropagation
Batch learning vs. Incremental learning




Batch standard backprop proceeds as
                                                              Incremental standard backprop can be done as follows:
follows:
                                                               Initialize the weights W.
 Initialize the weights W.
                                                               Repeat the following steps for j = 1 to NL:
 Repeat the following steps:
                                                                 Process one training case (y_j,X_j) to compute the gradient
   Process all the training data DL to compute the gradient
                                                                    of the error (loss) function Q(y_j,X_j,W).
      of the average error function AQ(DL,W).
                                                                 Update the weights by subtracting the gradient times the
   Update the weights by subtracting the gradient times the
                                                                    learning rate.
      learning rate.
Training
Overfitting
#2. Decision Trees
Introduction
                  Divide & conquer

                  Hierarchical model

                  Sequence of
                   recursive splits

                  Decision node vs.
                   leaf node

                  Advantage
                      Interpretability
                          IF-THEN rules
Divide and Conquer
   Internal decision nodes
       Univariate: Uses a single attribute, xi
           Numeric xi : Binary split : xi > wm
           Discrete xi : n-way split for n possible values
       Multivariate: Uses all attributes, x

   Leaves
       Classification: Class labels, or proportions
       Regression: Numeric; r average, or local fit

   Learning
       Construction of the tree using training examples
       Looking for the simplest tree among the trees that code the training
        data without error
           Based on heuristics
           NP-complete
           “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)
Classification Trees

   Split is main procedure for tree
    construction
       By impurity measure

   For node m, Nm instances reach m, Nim
    belong to Ci
                                    i
              ˆ Ci | x ,m  pm 
                               i   Nm
              P
                                   Nm                To be pure!!!

   Node m is pure if pim is 0 or 1
                                              K
   Measure of impurity is entropy      Im   pm log2pm
                                                 i      i

                                              i 1
Representation




   Each node specifies a test of some attribute of the instance

   Each branch correspond to one of the possible values for this
    attribute
Best Split
   If node m is pure, generate a leaf and stop, otherwise split
    and continue recursively

   Impurity after split: Nmj of Nm take branch j. Nimj belong to
    Ci                                               i
                                                   N mj
                         ˆ Ci | x ,m, j   pmj 
                         P                    i

                                                           N mj
                                 n     N mj   K
                         I'm               p     i
                                                     mj
                                                               i
                                                          log2pmj
                                j 1   Nm     i 1




   Find the variable and split that min impurity (among all
    variables -- and split positions for numeric variables)
Q) “Which attribute should be tested at the root of the tree?”
Top-Down Induction of Decision Trees
Entropy
   “Measure of uncertainty”
   “Expected number of bits to resolve uncertainty”

   Suppose Pr{X = 0} = 1/8
     If other events are equally likely, the number of events is 8. To indicate
      one out of so many events, one needs lg 8 bits.
   Consider a binary random variable X s.t. Pr{X = 0} = 0.1.

                                                         1  0.1 lg
                                                     1                      1
       The expected number of bits:      0.1 lg
                                                    0.1                 1  0.1
   In general, if a random variable X has c values with prob. p_c:
                                                c          c
                                                     1
       The expected number of bits:      H   pi lg   pi lg pi
                                              i 1   pi  i 1
Entropy
Example

   14 examples
                  Entropy([9,5])
                   (9 /14) log 2 (9 /14)  (5 /14) log 2 (5 /14)  0.940

         Entropy 0 : all members positive or negative
         Entropy 1 : equal number of positive & negative
         0 < Entropy < 1 : unequal number of positive & negative
Information Gain

   Measures the expected reduction in entropy caused by partitioning
    the examples
Information Gain
                              • # of samples = 100
 ICU-Student tree             • # of positive samples = 50
                Candidate     • Entropy = 1
                              Left side:
                              • # of samples = 50
           Gender             • # of positive samples = 40
                              • Entropy = 0.72
                              Right side:
    Male             Female   • # of samples = 50
                              • # of positive samples = 10
                              • Entropy = 0.72
     IQ              Height   On average
                              • Entropy = 0.5 * 0.72 + 0.5*0.72
                                          = 0.72
                              • Reduction in entropy = 0.28
                                 Information gain
Training Examples
Selecting the Next Attribute
Partially learned tree
Hypothesis Space Search
   Hypothesis space: the set of
    all possible decision trees

   DT is guided by information
    gain measure.




     Occam’s razor ??
Overfitting




•   Why “over”-fitting?
    – A model can become more complex than the true target
      function(concept) when it tries to satisfy noisy data as well
Avoiding over-fitting the data
   Two classes of approaches to avoid overfitting
       Stop growing the tree earlier.
       Post-prune the tree after overfitting

   Ok, but how to determine the optimal size of a tree?
       Use validation examples to evaluate the effect of pruning (stopping)
       Use a statistical test to estimate the effect of pruning (stopping)
       Use a measure of complexity for encoding decision tree.


   Approaches based on the first strategy
       Reduced error pruning
       Rule post-pruning
Rule Extraction from Trees

C4.5Rules
(Quinlan, 1993)
#3. Bayesian Networks
Bayes’ Rule
Introduction


                             prior     likelihood
       posterior
                                P C  p x | C 
                   P C | x  
                                     p x 

                                      evidence

 P C  0  P C  1  1
 p x   p x | C  1P C  1  p x | C  0P C  0
 p C  0 | x   P C  1 | x   1
Bayes’ Rule: K>2 Classes
Introduction


                         p x | Ci P Ci 
           P Ci | x  
                               p x 
                           p x | Ci P Ci 
                        K
                          p x | Ck P Ck 
                         k 1


                   K
  P Ci   0 and  P Ci   1
                  i 1

 choose Ci if P Ci | x   max k P Ck | x 
Bayesian Networks
Introduction

   Graphical models, probabilistic networks
       causality and influence

   Nodes are hypotheses (random vars) and the prob corresponds to our
    belief in the truth of the hypothesis

   Arcs are direct influences between hypotheses

   The structure is represented as a directed acyclic graph (DAG)
       Representation of the dependencies among random variables

   The parameters are the conditional probs in the arcs


        Small set of                                all possible
        probability, relating           B.N.        combinations of
        only neighbor node                          cicumstances
Bayesian Networks
Introduction




   Learning
       Inducing a graph
           From prior knowledge
           From structure learning
       Estimating parameters
           EM
   Inference
       Beliefs from evidences
           Especially among the nodes not directly connected
Structure
Introduction

   Initial configuration of BN
       Root nodes
         Prior probabilities
       Non-root nodes
         Conditional probabilities given all possible combinations of direct
          predecessors


                    P(a)                                P(b)
                           A                    B
           P(c|a)
                       C                D       P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb)
           P(c|ㄱa)
                               P(e|d)       E
                               P(e|ㄱd)
Causes and Bayes’ Rule
  Introduction




                          Diagnostic inference:
             diagnostic   Knowing that the grass is wet,
                          what is the probability that rain is
causal                    the cause?


                                      P W | R P R 
                          P R | W  
                                          P W 
                                                 P W | R P R 
                                    
                                      P W | R P R   P W |~ R P ~ R 
                                            0.9  0.4
                                                              0.75
                                      0.9  0.4  0.2  0.6
Causal vs Diagnostic Inference
Introduction


                                   Causal inference: If the
                                   sprinkler is on, what is the
                                   probability that the grass is wet?

                                   P(W|S) = P(W|R,S) P(R|S) +
                                           P(W|~R,S) P(~R|S)
                                    = P(W|R,S) P(R) +
                                           P(W|~R,S) P(~R)
                                    = 0.95*0.4 + 0.9*0.6 = 0.92


 Diagnostic inference: If the grass is wet, what is the probability
 that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)
 P(S|R,W) = 0.21
 Explaining away: Knowing that it has rained
         decreases the probability that the sprinkler is on.
Bayesian Networks: Causes
Introduction


                    Causal inference:
                    P(W|C) = P(W|R,S) P(R,S|C) +
                           P(W|~R,S) P(~R,S|C) +
                           P(W|R,~S) P(R,~S|C) +
                           P(W|~R,~S) P(~R,~S|C)

                    and use the fact that
                     P(R,S|C) = P(R|C) P(S|C)

                           Diagnostic: P(C|W ) = ?
Bayesian Nets: Local structure
Introduction




                                              P (F | C) = ?




                        d
      P X 1 , X d    P X i | parentsX i 
                       i 1
Bayesian Networks: Inference
Introduction


   P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )

   P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )

   P (F |C) = P (C,F ) / P(C )   Not efficient!


   Belief propagation (Pearl, 1988)
   Junction trees (Lauritzen and Spiegelhalter, 1988)
       Independence assumption
Inference
Evidence & Belief Propagation
   Evidence – values of observed nodes
       V3 = T,V6 = 3                           V1
   Our belief in what the value of Vi
    „should‟ be changes.
   This belief is propagated              V3        V2

    As if the CPTs became
                                                V4
        V3=T   1.0      P      V2=T V2=F
        V3=F   0.0      V6=1   0.0   0.0
                                           V5        V6
                        V6=2   0.0   0.0
                        V6=3   1.0   1.0
Belief Propagation
                                                                    Bayes Law:
                                                                               P( B | A) P( A)
                                                                   P( A | B) 
                                                                                   P( B)
            “Causal” message                   “Diagnostic” message
Going down arrow, sum out parent        Going up arrow, Bayes Law
Message




                             Messages




Specifically:
                                        1/a




                         9



                                              * some figures from: Peter Lucas BN lecture course
The  Messages

• What are the messages?
• For simplicity, let the nodes be binary
             V1=T     0.8           The message passes on information.
             V1=F     0.2           What information? Observe:
       V1
                                    P(V2| V1) = P(V2| V1=T)P(V1=T)
                                             + P(V2| V1=F)P(V1=F)

               P        V1=T V1=F        The information needed is the CPT
                                         of V1 = V(V1)
       V2      V2=T     0.4   0.9
               V2=F     0.6   0.1         Messages capture information
                                         passed from parent to child
The       Messages

• We know what the  messages are
• What about ?
                Assume E = { V2 } and compute by Bayes‟rule:
      V1                       P(V1 ) P(V2 | V1 )
               P(V1 | V2 )                        aP(V1 ) P(V2 | V1 )
                                   P(V2 )

                The information not available at V1 is the P(V2|V1). To be
      V2        passed upwards by a -message. Again, this is not in general
                exactly the CPT, but the belief based on evidence down the tree.
Belief Propagation

      U1                                          U2

                                λ(U2)
                    π(U1)
                                          π(U2)
            λ(U1)
                            V


           λ(V1)
                                        π(V2)
                    π(V1)
                                λ(V2)


      V1                                          V2
Evidence & Belief

                    V1           Evidence



    Belief    V3         V2



                    V4



              V5         V6

   Evidence

                    Works for classification ??
Naive Bayes’ Classifier




    Given C, xj are independent:

          p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
Application Procedures
For classification
   MLP
       Data collection & Pre-processing (Training data / Test data)
       Decision node selection (output node)
       Network training
       Generalization
       Parameter tuning & Pruning
       Final network
   Decision Trees
       Data collection & Pre-processing (Training data / Test data)
       Decision attribute selection
       Tree construction
       Pruning
       Final tree
   Bayesian Networks
       Data collection & Pre-processing (Training data / Test data)
       Structure configuration
             Prior knowledge
       Parameter learning
       Decision node selection
       Inference (classification)
             Evidence & belief
       Final network
Simulation
   Simulation Packages
       WEKA (JAVA)
           http://www.cs.waikato.ac.nz/ml/weka/
       FullBNT (MATLAB)
           http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
       MSBNx
           http://research.microsoft.com/msbn/
       MATLAB Neural Networks Toolbox
           http://www.mathworks.com/products/neuralnet/
       C4.5
           http://www.rulequest.com/Personal/
WEKA
FullBNT
   clear all


   N = 4;                      % 노드의 개수
   dag = zeros(N,N);                % 네크워크 구조 shell
   C = 1; S = 2; R = 3; W = 4;        % 각 노드 Naming
   dag(C,[R S]) = 1;              % 네트워크 구조 명시
   dag(R,W) = 1;
   dag(S,W)=1;


   %discrete_nodes = 1:N;
   node_sizes = 2*ones(1,N);           % 각 노드가 가질 수 있는 값의 개수
   %node_sizes = [4 2 3 5];
   %onodes = [];
   %bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);


   bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);
   %C = bnet.names('cloudy'); % bnet.names is an associative array
   %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);


   %%%%%% Specified Parameters
   %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
   %bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
   %bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
   %bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
MSBNx
References
   Textbooks
       Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004
       Tom Mitchell, Machine Learning, McGraw Hill, 1997
       Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003

   Materials
       Serafí Moral, Learning Bayesian Networks, University of Granada, Spain
             n
       Zheng Rong Yang, Connectionism, Exeter University
       KyuTae Cho ,Jeong Ki Yoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning,
        Especially for Bayesian Networks
       Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford
        University

   Recommended Textbooks
       Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
       J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992
       Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999
       Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007

Contenu connexe

Tendances

CCIA'2008: On the dimensions of data complexity through synthetic data sets
CCIA'2008: On the dimensions of data complexity through synthetic data setsCCIA'2008: On the dimensions of data complexity through synthetic data sets
CCIA'2008: On the dimensions of data complexity through synthetic data setsAlbert Orriols-Puig
 
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...Albert Orriols-Puig
 
Lecture11
Lecture11Lecture11
Lecture11Bo Li
 
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...Albert Orriols-Puig
 
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...Anderson Pinho
 
2 tri partite model algebra
2 tri partite model algebra2 tri partite model algebra
2 tri partite model algebraAle Cignetti
 
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...Albert Orriols-Puig
 
Lecture15 - Advances topics on association rules PART II
Lecture15 - Advances topics on association rules PART IILecture15 - Advances topics on association rules PART II
Lecture15 - Advances topics on association rules PART IIAlbert Orriols-Puig
 
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCS
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCSHIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCS
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCSAlbert Orriols-Puig
 
Multivariate analyses &amp; decoding
Multivariate analyses &amp; decodingMultivariate analyses &amp; decoding
Multivariate analyses &amp; decodingkhbrodersen
 

Tendances (20)

Lecture4 - Machine Learning
Lecture4 - Machine LearningLecture4 - Machine Learning
Lecture4 - Machine Learning
 
CCIA'2008: On the dimensions of data complexity through synthetic data sets
CCIA'2008: On the dimensions of data complexity through synthetic data setsCCIA'2008: On the dimensions of data complexity through synthetic data sets
CCIA'2008: On the dimensions of data complexity through synthetic data sets
 
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
 
Lecture11
Lecture11Lecture11
Lecture11
 
Lecture6 - C4.5
Lecture6 - C4.5Lecture6 - C4.5
Lecture6 - C4.5
 
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
 
Lecture17
Lecture17Lecture17
Lecture17
 
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
 
Lecture1 - Machine Learning
Lecture1 - Machine LearningLecture1 - Machine Learning
Lecture1 - Machine Learning
 
Lecture3 - Machine Learning
Lecture3 - Machine LearningLecture3 - Machine Learning
Lecture3 - Machine Learning
 
Lecture19
Lecture19Lecture19
Lecture19
 
2 tri partite model algebra
2 tri partite model algebra2 tri partite model algebra
2 tri partite model algebra
 
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
 
Lecture2 - Machine Learning
Lecture2 - Machine LearningLecture2 - Machine Learning
Lecture2 - Machine Learning
 
Lecture7 - IBk
Lecture7 - IBkLecture7 - IBk
Lecture7 - IBk
 
Lecture15 - Advances topics on association rules PART II
Lecture15 - Advances topics on association rules PART IILecture15 - Advances topics on association rules PART II
Lecture15 - Advances topics on association rules PART II
 
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCS
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCSHIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCS
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCS
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
SSBSE11b.ppt
SSBSE11b.pptSSBSE11b.ppt
SSBSE11b.ppt
 
Multivariate analyses &amp; decoding
Multivariate analyses &amp; decodingMultivariate analyses &amp; decoding
Multivariate analyses &amp; decoding
 

Similaire à Introduction to Machine Learning

20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aiaYi-Fan Liou
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...ICAC09
 
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路台灣資料科學年會
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkshesnasuneer
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkshesnasuneer
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
An introduc on to Machine Learning
An introduc on to Machine LearningAn introduc on to Machine Learning
An introduc on to Machine Learningbutest
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligenceHITESH Kumawat
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Auto-encoding variational bayes
Auto-encoding variational bayesAuto-encoding variational bayes
Auto-encoding variational bayesKyuri Kim
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?Philip Zheng
 
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...United States Air Force Academy
 

Similaire à Introduction to Machine Learning (20)

tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aia
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
 
[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路[系列活動] 一日搞懂生成式對抗網路
[系列活動] 一日搞懂生成式對抗網路
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
alexVAE_New.pdf
alexVAE_New.pdfalexVAE_New.pdf
alexVAE_New.pdf
 
An introduc on to Machine Learning
An introduc on to Machine LearningAn introduc on to Machine Learning
An introduc on to Machine Learning
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Auto-encoding variational bayes
Auto-encoding variational bayesAuto-encoding variational bayes
Auto-encoding variational bayes
 
Spatio-temporal reasoning for traffic scene understanding
Spatio-temporal reasoning for traffic scene understandingSpatio-temporal reasoning for traffic scene understanding
Spatio-temporal reasoning for traffic scene understanding
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
What's Wrong With Deep Learning?
What's Wrong With Deep Learning?What's Wrong With Deep Learning?
What's Wrong With Deep Learning?
 
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
Scaled Eigen Appearance and Likelihood Prunning for Large Scale Video Duplica...
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Introduction to Machine Learning

  • 1. Introduction to Machine Learning Jinhyuk Choi Human-Computer Interaction Lab @ Information and Communications University
  • 2. Contents  Concepts of Machine Learning  Multilayer Perceptrons  Decision Trees  Bayesian Networks
  • 3. What is Machine Learning?  Large storage / large amount of data  Looks random but certain patterns  Web log data  Medical record  Network optimization  Bioinformatics  Machine vision  Speech recognition…  No complete identification of the process  A good or useful approximation
  • 4. What is Machine Learning? Definition  Programming computers to optimize a performance criterion using example data or past experience  Role of Statistics  Inference from a sample  Role of Computer science  Efficient algorithms to solve the optimization problem  Representing and evaluating the model for inference  Descriptive (training) / predictive (generalization) Learning from Human-generated data??
  • 5. What is Machine Learning? Concept Learning • Inducing general functions from specific training examples (positive or negative) • Looking for the hypothesis that best fits the training examples Objects Concept 눈, 코, 다리 Bird 생식능력, 날개, 부리, boolean function : … 깃털… Bird(animal)  “true or not” 무생물… • Concepts: - describing some subset of objects or events defined over a larger set - a boolean-valued function
  • 6. What is Machine Learning? Concept Learning  Inferring a boolean-valued function from training examples of its input and output Hypothesis 1 Hypothesis 2 Concept Web log data Medical record Network optimization Positive examples Bioinformatics Negative examples Machine vision Speech recognition…
  • 7. What is Machine Learning? Learning Problem Design  Do you enjoy sports ?  Learn to predict the value of “EnjoySports” for an arbitrary day, based on the value of its other attributes  What problem?  Why learning?  Attributes selection  Effective?  Enough?  What learning algorithm?
  • 8. Applications  Learning associations  Classification  Regression  Unsupervised learning  Reinforcement learning
  • 9. Examples (1)  TV program preference inference based on web usage data Web page #1 TV Program #1 Web page #2 TV Program #2 Web page #3 Classifier TV Program #3 Web page #4 1 2 TV Program #4 …. …. 3 What are we supposed to do at each step?
  • 10. Examples (2) from a HW of Neural Networks Class (KAIST-2002)  Function approximation (Mexican hat)   f3 ( x1 , x2 )  sin 2 x12  x2 , 2 x1 , x2 [1,1]
  • 11. Examples (3) from a HW of Machine Learning Class (ICU-2006)  Face image classification
  • 12. Examples (4) from a HW of Machine Learning Class (ICU-2006)
  • 13. Examples (5) from a HW of Machine Learning Class (ICU-2006)  Sensay
  • 14. Examples (6) A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable Computing”, ISWC 2005
  • 16. Neural Network? VS. Adaline MLP SOM Hopfield network RDFN Bifurcating neuron networks …
  • 17. Multilayer Networks of Sigmoid Units • Supervised learning • 2-layer • Fully connected Really looks like the brain??
  • 19. The back-propagation algorithm  Network model Input layer hidden layer output layer xi yj ok v ji wkj     y j  s  v ji x i     w y  ok  s  kj j    i       j    1 E v , w    tk  ok  2  Error function: 2 k  Stochastic gradient descent
  • 21. Gradient-descent function minimization    In order to find a vector parameter x that minimizes a function f x  …    Start with a random initial value of x  x 0 .  Determine the direction of the steepest descent in the parameter space by  f f f    f    , ,...,   x x  1 2 x n     Move to the direction a step.    x i  1  x i   hf  x  Repeat the above two steps until no more change in .  For gradient-descent to work…  The function to be minimized should be continuous.  The function should not have too many local minima.
  • 23. Derivation of back-propagation algorithm Adjustment of wkj : 2 E  1 2 1   t  s  w y        tk  ok     k   k j j  wk j  wk j   2 k  2 wk j     j     1  y j ok  1  ok  1 2 tk   ok   2  y j ok  1  ok  tk   ok   E wkj  h  h ok 1  ok tk  ok y j wkj  o  d k
  • 24. Derivation of back-propagation algorithm Adjustment of vji : 2 E  1 2 1   t  s  w y        tk  ok       kj j   v j i  v j i   2 k  2 k v j i   k     j    2 1   t  s  w s  v x      k   kj  ji i     2 k v j i    j   i    1   x i y j  1  y j   wkj ok 1  ok 1 2 tk  ok  2 k  x i y j  1  y j    wkj ok 1  ok tk  ok  k E v ji  h  hy j 1  y j   wkjok 1  ok tk  ok x i v ji k  h y j 1  y j   wkj dko x i  y k  dj
  • 26. Batch learning vs. Incremental learning Batch standard backprop proceeds as Incremental standard backprop can be done as follows: follows: Initialize the weights W. Initialize the weights W. Repeat the following steps for j = 1 to NL: Repeat the following steps: Process one training case (y_j,X_j) to compute the gradient Process all the training data DL to compute the gradient of the error (loss) function Q(y_j,X_j,W). of the average error function AQ(DL,W). Update the weights by subtracting the gradient times the Update the weights by subtracting the gradient times the learning rate. learning rate.
  • 30. Introduction  Divide & conquer  Hierarchical model  Sequence of recursive splits  Decision node vs. leaf node  Advantage  Interpretability  IF-THEN rules
  • 31. Divide and Conquer  Internal decision nodes  Univariate: Uses a single attribute, xi  Numeric xi : Binary split : xi > wm  Discrete xi : n-way split for n possible values  Multivariate: Uses all attributes, x  Leaves  Classification: Class labels, or proportions  Regression: Numeric; r average, or local fit  Learning  Construction of the tree using training examples  Looking for the simplest tree among the trees that code the training data without error  Based on heuristics  NP-complete  “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)
  • 32. Classification Trees  Split is main procedure for tree construction  By impurity measure  For node m, Nm instances reach m, Nim belong to Ci i ˆ Ci | x ,m  pm  i Nm P Nm To be pure!!!  Node m is pure if pim is 0 or 1 K  Measure of impurity is entropy Im   pm log2pm i i i 1
  • 33. Representation  Each node specifies a test of some attribute of the instance  Each branch correspond to one of the possible values for this attribute
  • 34. Best Split  If node m is pure, generate a leaf and stop, otherwise split and continue recursively  Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci i N mj ˆ Ci | x ,m, j   pmj  P i N mj n N mj K I'm   p i mj i log2pmj j 1 Nm i 1  Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) Q) “Which attribute should be tested at the root of the tree?”
  • 35. Top-Down Induction of Decision Trees
  • 36. Entropy  “Measure of uncertainty”  “Expected number of bits to resolve uncertainty”  Suppose Pr{X = 0} = 1/8  If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits.  Consider a binary random variable X s.t. Pr{X = 0} = 0.1.  1  0.1 lg 1 1  The expected number of bits: 0.1 lg 0.1 1  0.1  In general, if a random variable X has c values with prob. p_c: c c 1  The expected number of bits: H   pi lg   pi lg pi i 1 pi i 1
  • 37. Entropy Example  14 examples Entropy([9,5])  (9 /14) log 2 (9 /14)  (5 /14) log 2 (5 /14)  0.940 Entropy 0 : all members positive or negative Entropy 1 : equal number of positive & negative 0 < Entropy < 1 : unequal number of positive & negative
  • 38. Information Gain  Measures the expected reduction in entropy caused by partitioning the examples
  • 39. Information Gain • # of samples = 100 ICU-Student tree • # of positive samples = 50 Candidate • Entropy = 1 Left side: • # of samples = 50 Gender • # of positive samples = 40 • Entropy = 0.72 Right side: Male Female • # of samples = 50 • # of positive samples = 10 • Entropy = 0.72 IQ Height On average • Entropy = 0.5 * 0.72 + 0.5*0.72 = 0.72 • Reduction in entropy = 0.28  Information gain
  • 41. Selecting the Next Attribute
  • 43. Hypothesis Space Search  Hypothesis space: the set of all possible decision trees  DT is guided by information gain measure. Occam’s razor ??
  • 44. Overfitting • Why “over”-fitting? – A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well
  • 45. Avoiding over-fitting the data  Two classes of approaches to avoid overfitting  Stop growing the tree earlier.  Post-prune the tree after overfitting  Ok, but how to determine the optimal size of a tree?  Use validation examples to evaluate the effect of pruning (stopping)  Use a statistical test to estimate the effect of pruning (stopping)  Use a measure of complexity for encoding decision tree.  Approaches based on the first strategy  Reduced error pruning  Rule post-pruning
  • 46. Rule Extraction from Trees C4.5Rules (Quinlan, 1993)
  • 48. Bayes’ Rule Introduction prior likelihood posterior P C  p x | C  P C | x   p x  evidence P C  0  P C  1  1 p x   p x | C  1P C  1  p x | C  0P C  0 p C  0 | x   P C  1 | x   1
  • 49. Bayes’ Rule: K>2 Classes Introduction p x | Ci P Ci  P Ci | x   p x  p x | Ci P Ci   K  p x | Ck P Ck  k 1 K P Ci   0 and  P Ci   1 i 1 choose Ci if P Ci | x   max k P Ck | x 
  • 50. Bayesian Networks Introduction  Graphical models, probabilistic networks  causality and influence  Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis  Arcs are direct influences between hypotheses  The structure is represented as a directed acyclic graph (DAG)  Representation of the dependencies among random variables  The parameters are the conditional probs in the arcs Small set of all possible probability, relating B.N. combinations of only neighbor node cicumstances
  • 51. Bayesian Networks Introduction  Learning  Inducing a graph  From prior knowledge  From structure learning  Estimating parameters  EM  Inference  Beliefs from evidences  Especially among the nodes not directly connected
  • 52. Structure Introduction  Initial configuration of BN  Root nodes  Prior probabilities  Non-root nodes  Conditional probabilities given all possible combinations of direct predecessors P(a) P(b) A B P(c|a) C D P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb) P(c|ㄱa) P(e|d) E P(e|ㄱd)
  • 53. Causes and Bayes’ Rule Introduction Diagnostic inference: diagnostic Knowing that the grass is wet, what is the probability that rain is causal the cause? P W | R P R  P R | W   P W  P W | R P R   P W | R P R   P W |~ R P ~ R  0.9  0.4   0.75 0.9  0.4  0.2  0.6
  • 54. Causal vs Diagnostic Inference Introduction Causal inference: If the sprinkler is on, what is the probability that the grass is wet? P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S) = P(W|R,S) P(R) + P(W|~R,S) P(~R) = 0.95*0.4 + 0.9*0.6 = 0.92 Diagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S) P(S|R,W) = 0.21 Explaining away: Knowing that it has rained decreases the probability that the sprinkler is on.
  • 55. Bayesian Networks: Causes Introduction Causal inference: P(W|C) = P(W|R,S) P(R,S|C) + P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C) and use the fact that P(R,S|C) = P(R|C) P(S|C) Diagnostic: P(C|W ) = ?
  • 56. Bayesian Nets: Local structure Introduction P (F | C) = ? d P X 1 , X d    P X i | parentsX i  i 1
  • 57. Bayesian Networks: Inference Introduction  P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )  P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )  P (F |C) = P (C,F ) / P(C ) Not efficient!  Belief propagation (Pearl, 1988)  Junction trees (Lauritzen and Spiegelhalter, 1988)  Independence assumption
  • 58. Inference Evidence & Belief Propagation  Evidence – values of observed nodes  V3 = T,V6 = 3 V1  Our belief in what the value of Vi „should‟ be changes.  This belief is propagated V3 V2  As if the CPTs became V4 V3=T 1.0 P V2=T V2=F V3=F 0.0 V6=1 0.0 0.0 V5 V6 V6=2 0.0 0.0 V6=3 1.0 1.0
  • 59. Belief Propagation Bayes Law: P( B | A) P( A) P( A | B)  P( B) “Causal” message “Diagnostic” message Going down arrow, sum out parent Going up arrow, Bayes Law Message Messages Specifically: 1/a 9 * some figures from: Peter Lucas BN lecture course
  • 60. The  Messages • What are the messages? • For simplicity, let the nodes be binary V1=T 0.8 The message passes on information. V1=F 0.2 What information? Observe: V1 P(V2| V1) = P(V2| V1=T)P(V1=T) + P(V2| V1=F)P(V1=F) P V1=T V1=F The information needed is the CPT of V1 = V(V1) V2 V2=T 0.4 0.9 V2=F 0.6 0.1  Messages capture information passed from parent to child
  • 61. The  Messages • We know what the  messages are • What about ? Assume E = { V2 } and compute by Bayes‟rule: V1 P(V1 ) P(V2 | V1 ) P(V1 | V2 )   aP(V1 ) P(V2 | V1 ) P(V2 ) The information not available at V1 is the P(V2|V1). To be V2 passed upwards by a -message. Again, this is not in general exactly the CPT, but the belief based on evidence down the tree.
  • 62. Belief Propagation U1 U2 λ(U2) π(U1) π(U2) λ(U1) V λ(V1) π(V2) π(V1) λ(V2) V1 V2
  • 63. Evidence & Belief V1 Evidence Belief V3 V2 V4 V5 V6 Evidence Works for classification ??
  • 64. Naive Bayes’ Classifier Given C, xj are independent: p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
  • 65. Application Procedures For classification  MLP  Data collection & Pre-processing (Training data / Test data)  Decision node selection (output node)  Network training  Generalization  Parameter tuning & Pruning  Final network  Decision Trees  Data collection & Pre-processing (Training data / Test data)  Decision attribute selection  Tree construction  Pruning  Final tree  Bayesian Networks  Data collection & Pre-processing (Training data / Test data)  Structure configuration  Prior knowledge  Parameter learning  Decision node selection  Inference (classification)  Evidence & belief  Final network
  • 66. Simulation  Simulation Packages  WEKA (JAVA)  http://www.cs.waikato.ac.nz/ml/weka/  FullBNT (MATLAB)  http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html  MSBNx  http://research.microsoft.com/msbn/  MATLAB Neural Networks Toolbox  http://www.mathworks.com/products/neuralnet/  C4.5  http://www.rulequest.com/Personal/
  • 67. WEKA
  • 68. FullBNT  clear all  N = 4; % 노드의 개수  dag = zeros(N,N); % 네크워크 구조 shell  C = 1; S = 2; R = 3; W = 4; % 각 노드 Naming  dag(C,[R S]) = 1; % 네트워크 구조 명시  dag(R,W) = 1;  dag(S,W)=1;  %discrete_nodes = 1:N;  node_sizes = 2*ones(1,N); % 각 노드가 가질 수 있는 값의 개수  %node_sizes = [4 2 3 5];  %onodes = [];  %bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);  bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);  %C = bnet.names('cloudy'); % bnet.names is an associative array  %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);  %%%%%% Specified Parameters  %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);  %bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);  %bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);  %bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
  • 69. MSBNx
  • 70. References  Textbooks  Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004  Tom Mitchell, Machine Learning, McGraw Hill, 1997  Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003  Materials  Serafí Moral, Learning Bayesian Networks, University of Granada, Spain n  Zheng Rong Yang, Connectionism, Exeter University  KyuTae Cho ,Jeong Ki Yoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning, Especially for Bayesian Networks  Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford University  Recommended Textbooks  Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006  J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992  Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999  Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007