SlideShare une entreprise Scribd logo
1  sur  35
Introduction to Statistical
                                                                                   Machine Learning

                                                                                       c 2011
                                                                                 Christfried Webers
                                                                                       NICTA
                                                                               The Australian National
                                                                                     University

Introduction to Statistical Machine Learning

                         Christfried Webers

                      Statistical Machine Learning Group
                                     NICTA
                                      and
                College of Engineering and Computer Science
                      The Australian National University


                            Canberra
                       February – June 2011



 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")



                                                                                                  1of 146
Introduction to Statistical
                                  Machine Learning

                                      c 2011
                                Christfried Webers
                                      NICTA
                              The Australian National
                                    University




         Part IV
                              Boxes with Apples and
                              Oranges

                              Bayes’ Theorem

Probability and Uncertainty   Bayes’ Probabilities

                              Probability Distributions

                              Gaussian Distribution
                              over a Vector

                              Decision Theory

                              Model Selection - Key
                              Ideas




                                               113of 146
Introduction to Statistical

Simple Experiment                                                       Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University
  1   Choose a box.
          Red box p(B = r) = 4/10
          Blue box p(B = b) = 6/10
  2   Choose any item of the selected box with equal probability.
                                                                    Boxes with Apples and
                                                                    Oranges
      Given that we have chosen an orange, what is the
                                                                    Bayes’ Theorem
      probability that the box we chose was the blue one?
                                                                    Bayes’ Probabilities

                                                                    Probability Distributions

                                                                    Gaussian Distribution
                                                                    over a Vector

                                                                    Decision Theory

                                                                    Model Selection - Key
                                                                    Ideas




                                                                                     114of 146
Introduction to Statistical

What do we know?                                             Machine Learning

                                                                 c 2011
                                                           Christfried Webers
                                                                 NICTA
                                                         The Australian National
                                                               University


   p(F = o | B = b) = 1/4
   p(F = a | B = b) = 3/4
   p(F = o | B = r) = 3/4
                                                         Boxes with Apples and
   p(F = a | B = r) = 1/4                                Oranges

                                                         Bayes’ Theorem
   Given that we have chosen an orange, what is the
                                                         Bayes’ Probabilities
   probability that the box we chose was the blue one?   Probability Distributions

                                                         Gaussian Distribution
                        p(B = b | F = o)?                over a Vector

                                                         Decision Theory

                                                         Model Selection - Key
                                                         Ideas




                                                                          115of 146
Introduction to Statistical

Calculating the Posterior p(B = b | F = o)                   Machine Learning

                                                                 c 2011
                                                           Christfried Webers
                                                                 NICTA
                                                         The Australian National
                                                               University

    Bayes’ Theorem

                                p(F = o|B = b)p(B = b)
           p(B = b | F = o) =
                                       p(F = o)
                                                         Boxes with Apples and
                                                         Oranges
    Sum Rule for the denominator                         Bayes’ Theorem

                                                         Bayes’ Probabilities
          p(F = o) = p(F = o, B = b) + p(F = o, B = r)   Probability Distributions

                   = p(F = o|B = b)p(B = b)              Gaussian Distribution
                                                         over a Vector

                   + p(F = o|B = r)p(B = r)              Decision Theory

                                                         Model Selection - Key
                     1    6    3    4     9              Ideas
                   = ×       + ×      =
                     4 10 4 10           20

                                   1   6   20   1
              p(B = b | F = o) =     ×   ×    =
                                   4 10     9   3


                                                                          116of 146
Introduction to Statistical

Bayes’ Theorem - revisited                                           Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University




    Before choosing an item from a box: most complete
    information in p(B) (prior).
                                         6
    Note, that in our example p(B = b) = 10 . Therefore          Boxes with Apples and
                                                                 Oranges
    choosing the box from the prior, we would opt for the blue
                                                                 Bayes’ Theorem
    box.
                                                                 Bayes’ Probabilities
    Once we observe some data (e.g. choose an orange), we        Probability Distributions

    can calculate p(B = b | F = o) (posterior probability) via   Gaussian Distribution
                                                                 over a Vector
    Bayes’ theorem.
                                                                 Decision Theory
    After observing an orange, the posterior probability         Model Selection - Key
                       1                                  2      Ideas
    p(B = b | F = o) = 3 and therefore p(B = r | F = o) = 3 .
    Observing an orange it is now more likely that the orange
    came from the red box.



                                                                                  117of 146
Introduction to Statistical

Bayes’ Rule                                                    Machine Learning

                                                                   c 2011
                                                             Christfried Webers
                                                                   NICTA
                                                           The Australian National
                                                                 University




                                                           Boxes with Apples and
                                                           Oranges
                    likelihood   prior
        posterior                                          Bayes’ Theorem

                   p(X | Y) p(Y)         p(X | Y) p(Y)     Bayes’ Probabilities
        p(Y | X) =               =
                       p(X)               Y p(X | Y)p(Y)
                                                           Probability Distributions

                                                           Gaussian Distribution
                      normalisation                        over a Vector

                                                           Decision Theory

                                                           Model Selection - Key
                                                           Ideas




                                                                            118of 146
Introduction to Statistical

Bayes’ Probabilities                                                    Machine Learning

                                                                            c 2011
                                                                      Christfried Webers
                                                                            NICTA
                                                                    The Australian National
                                                                          University




    classical or frequentist interpretation of probabilities
    Bayesian view: probabilities represent uncertainty              Boxes with Apples and
                                                                    Oranges

    Example: Will the Arctic ice cap have disappeared by the        Bayes’ Theorem

    end of the century?                                             Bayes’ Probabilities

                                                                    Probability Distributions
    fresh evidence can change the opinion on ice loss               Gaussian Distribution
                                                                    over a Vector
    goal: quantify uncertainty and revise uncertainty in light of
                                                                    Decision Theory
    new evidence
                                                                    Model Selection - Key
                                                                    Ideas
    use Bayesian interpretation of probability




                                                                                     119of 146
Introduction to Statistical

Andrey Kolmogorov - Axiomatic Probability                             Machine Learning

                                                                          c 2011

Theory (1933)                                                       Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




                                                                  Boxes with Apples and
    Let (Ω, F, P) be a measure space with P(Ω) = 1.               Oranges

    Then (Ω, F, P) is a probability space, with sample space Ω,   Bayes’ Theorem

                                                                  Bayes’ Probabilities
    event space F and probability measure P.
                                                                  Probability Distributions
    1. Axiom P(E) ≥ 0    ∀E ∈ F.                                  Gaussian Distribution
                                                                  over a Vector
    2. Axiom P(Ω) = 1.                                            Decision Theory

    3. Axiom P(E1 ∪ E2 ∪ . . . ) = i P(Ei ) for any countable     Model Selection - Key
                                                                  Ideas
    sequence of pairwise disjoint events E1 , E2 , . . . .




                                                                                   120of 146
Introduction to Statistical

Richard Threlkeld Cox - (1946, 1961)                                   Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
                                                                   The Australian National
                                                                         University



    Assume numerical values are used to represent degrees
    of belief.
    Define a set of axioms encoding common sense                    Boxes with Apples and
    properties of such beliefs.                                    Oranges

                                                                   Bayes’ Theorem
    Results in a set of rules for manipulating degrees of belief   Bayes’ Probabilities
    which are equivalent to the sum and product rule of            Probability Distributions
    probability.                                                   Gaussian Distribution
                                                                   over a Vector
    many other authors have proposed different sets of             Decision Theory
    axioms and properties                                          Model Selection - Key
                                                                   Ideas
    Result: the numerical quantities behave all according to
    the rules of probability
    Denote these quantities as Bayesian probabilities.



                                                                                    121of 146
Introduction to Statistical

Curve fitting - revisited                                            Machine Learning

                                                                        c 2011
                                                                  Christfried Webers
                                                                        NICTA
                                                                The Australian National
                                                                      University




    uncertainty about the parameter w captured in the prior
    probability p(w)
    observed data D = {t1 , . . . , tN }                        Boxes with Apples and
                                                                Oranges
    calculate the uncertainty in w after the data D have been   Bayes’ Theorem

    observed                                                    Bayes’ Probabilities
                                  p(D | w)p(w)                  Probability Distributions
                     p(w | D) =
                                      p(D)                      Gaussian Distribution
                                                                over a Vector

    p(D | w) as a function of w : likelihood function           Decision Theory

                                                                Model Selection - Key
    likelihood expresses how probable the data are for          Ideas

    different values of w
    not a probability function over w




                                                                                 122of 146
Introduction to Statistical

Likelihood Function - Frequentist versus Bayesian                Machine Learning

                                                                     c 2011
                                                               Christfried Webers
                                                                     NICTA
                                                             The Australian National
                                                                   University




                 Likelihood function p(D | w)

   Frequentist Approach           Bayesian Approach          Boxes with Apples and
                                                             Oranges
       w considered fixed              only one single data   Bayes’ Theorem
       parameter                      set D                  Bayes’ Probabilities

       value defined by                uncertainty in the     Probability Distributions

       some ’estimator’               parameters comes       Gaussian Distribution
                                                             over a Vector

       error bars on the              from a probability     Decision Theory

       estimated w                    distribution over w    Model Selection - Key
                                                             Ideas
       obtained from the
       distribution of
       possible data sets D



                                                                              123of 146
Introduction to Statistical

Frequentist Estimator - Maximum Likelihood                           Machine Learning

                                                                         c 2011
                                                                   Christfried Webers
                                                                         NICTA
                                                                 The Australian National
                                                                       University




    choose w for which the likelihood p(D | w) is maximal
    choose w for which the probability of the observed data is
    maximal                                                      Boxes with Apples and
                                                                 Oranges
    Machine Learning: error function is negative log of          Bayes’ Theorem
    likelihood function                                          Bayes’ Probabilities

    log is a monoton function                                    Probability Distributions

                                                                 Gaussian Distribution
    maximising the likelihood ⇐⇒ minimising the error            over a Vector

                                                                 Decision Theory
    Example: Fair-looking coin is tossed three times, always
                                                                 Model Selection - Key
    landing on heads.                                            Ideas


    Maximum likelihood estimate of the probability of landing
    heads will give 1.




                                                                                  124of 146
Introduction to Statistical

Bayesian Approach                                               Machine Learning

                                                                    c 2011
                                                              Christfried Webers
                                                                    NICTA
                                                            The Australian National
                                                                  University




   including prior knowledge easy (via prior w)
   BUT: if prior is badly chosen, can lead to bad results
                                                            Boxes with Apples and
   subjective choice of prior                               Oranges

                                                            Bayes’ Theorem
   sometimes choice of prior motivated by convinient
                                                            Bayes’ Probabilities
   mathematical form                                        Probability Distributions

   need to sum/integrate over the whole parameter space     Gaussian Distribution
                                                            over a Vector
   advances in sampling (Markov Chain Monte Carlo           Decision Theory

   methods)                                                 Model Selection - Key
                                                            Ideas
   advances in approximation schemes (Variational Bayes,
   Expectation Propagation)




                                                                             125of 146
Introduction to Statistical

The Gaussian Distribution                                                     Machine Learning

                                                                                  c 2011
                                                                            Christfried Webers
                                                                                  NICTA
                                                                          The Australian National
                                                                                University


    x∈R
    Gaussian Distribution with mean µ and variance σ 2
                                     1                   1                Boxes with Apples and
            N (x | µ, σ 2 ) =               1   exp{−        (x − µ)2 }   Oranges
                                 (2πσ 2 )   2           2σ 2              Bayes’ Theorem

                                                                          Bayes’ Probabilities

                 N (x|µ, σ 2 )                                            Probability Distributions

                                                                          Gaussian Distribution
                                                                          over a Vector

                                                                          Decision Theory
                                    2σ                                    Model Selection - Key
                                                                          Ideas




                                    µ                      x




                                                                                           126of 146
Introduction to Statistical

The Gaussian Distribution                                                  Machine Learning

                                                                               c 2011
                                                                         Christfried Webers
                                                                               NICTA
                                                                       The Australian National
                                                                             University

    N (x | µ, σ 2 ) > 0
     ∞
     −∞
          N (x | µ, σ 2 ) dx = 1
    Expectation over x
                                                                       Boxes with Apples and
                                    ∞                                  Oranges
                                                 2
                    E [x] =             N (x | µ, σ ) x dx = µ         Bayes’ Theorem
                                −∞                                     Bayes’ Probabilities

                                                                       Probability Distributions
                            2
    Expectation over x                                                 Gaussian Distribution
                                                                       over a Vector
                                ∞
                                                                       Decision Theory
              E x2 =                N (x | µ, σ 2 ) x2 dx = µ2 + σ 2   Model Selection - Key
                             −∞                                        Ideas


    Variance of x
                                                     2
                          var[x] = E x2 − E [x] = σ 2


                                                                                        127of 146
Introduction to Statistical

The Mode of a Probability Distribution                                    Machine Learning

                                                                              c 2011
                                                                        Christfried Webers
                                                                              NICTA
                                                                      The Australian National
                                                                            University
    Mode of a distribution : the value that occurs the most
    frequently in a probability distribution.
    For a probability density function: the value x at which the
    probability density attains its maximum.
                                                                      Boxes with Apples and
    Gaussian Distribution has one mode (unimodular); the              Oranges

    mode is µ.                                                        Bayes’ Theorem

                                                                      Bayes’ Probabilities
    If there are multiple local maxima in the probability
                                                                      Probability Distributions
    distribution, the probability distribution is called multimodal   Gaussian Distribution
    (example: mixture of three Gaussians).                            over a Vector

                                                                      Decision Theory
      N (x|µ, σ 2 )
                                   p(x)
                                                                      Model Selection - Key
                                                                      Ideas



                      2σ




                      µ        x                                x

                                                                                       128of 146
Introduction to Statistical

The Bernoulli Distribution                                         Machine Learning

                                                                       c 2011
                                                                 Christfried Webers
                                                                       NICTA
                                                               The Australian National
                                                                     University




    Two possible outcomes x ∈ {0, 1} (e.g. coin which may be
    damaged).
    p(x = 1 | µ) = µ for 0 ≤ µ ≤ 1                             Boxes with Apples and
                                                               Oranges
    p(x = 0 | µ) = 1 − µ                                       Bayes’ Theorem

    Bernoulli Distribution                                     Bayes’ Probabilities

                                                               Probability Distributions

                     Bern(x | µ) = µx (1 − µ)1−x               Gaussian Distribution
                                                               over a Vector

                                                               Decision Theory
    Expectation over x
                                                               Model Selection - Key
                                 E [x] = µ                     Ideas


    Variance of x
                             var[x] = µ(1 − µ)




                                                                                129of 146
Introduction to Statistical

The Binomial Distribution                                                     Machine Learning

                                                                                  c 2011
                                                                            Christfried Webers
                                                                                  NICTA
                                                                          The Australian National
                                                                                University

    Flip a coin N times. What is the distribution to observe
    heads exactly m times?
    This is a distribution over m = {0, . . . , N}.
    Binomial Distribution                                                 Boxes with Apples and
                                                                          Oranges

                                                 N m                      Bayes’ Theorem
                Bin(m | N, µ) =                    µ (1 − µ)N−m
                                                 m                        Bayes’ Probabilities

                                                                          Probability Distributions

                      0.3                                                 Gaussian Distribution
                                                                          over a Vector

                                                                          Decision Theory
                      0.2
                                                                          Model Selection - Key
                                                                          Ideas
                      0.1



                       0
                            0    1   2   3   4   5   6   7   8   9   10
                                                 m



                                N = 10, µ = 0.25

                                                                                           130of 146
Introduction to Statistical

The Beta Distribution                                                       Machine Learning

                                                                                c 2011
                                                                          Christfried Webers
                                                                                NICTA
                                                                        The Australian National
    Beta Distribution                                                         University



                                      Γ(a + b) a−1
                   Beta(µ | a, b) =            µ (1 − µ)b−1
                                      Γ(a)Γ(b)
          3                                 3                           Boxes with Apples and
                  a = 0.1                           a=1                 Oranges
                  b = 0.1                           b=1
          2                                 2                           Bayes’ Theorem

                                                                        Bayes’ Probabilities

          1                                 1                           Probability Distributions

                                                                        Gaussian Distribution
                                                                        over a Vector
          0                                 0
              0             0.5   µ     1       0         0.5   µ   1
                                                                        Decision Theory

          3                                 3                           Model Selection - Key
                  a=2                               a=8                 Ideas
                  b=3                               b=4
          2                                 2



          1                                 1



          0                                 0
              0             0.5   µ     1       0         0.5   µ   1



                                                                                         131of 146
Introduction to Statistical

The Gaussian Distribution over a Vector x                                      Machine Learning

                                                                                   c 2011
                                                                             Christfried Webers
                                                                                   NICTA
                                                                           The Australian National
                                                                                 University
    x ∈ RD
    Gaussian Distribution with mean µ ∈ RD and covariance
    matrix Σ ∈ RD×D
                            1       1             1
      N (x | µ, Σ) =            D        1   exp{− (x − µ)T Σ−1 (x − µ)}   Boxes with Apples and
                                                                           Oranges
                          (2π) 2 |Σ| 2            2
                                                                           Bayes’ Theorem

                                                                           Bayes’ Probabilities
    where |Σ| is the determinant of Σ.
                                                                           Probability Distributions
                x2                                                         Gaussian Distribution
                                             u2
                                                                           over a Vector

                                                   u1                      Decision Theory

                                                                           Model Selection - Key
                                    y2                                     Ideas
                                                          y1
                                             µ
                      1/2
                     λ2
                                                    1/2
                                                   λ1


                                                               x1


                                                                                            132of 146
Introduction to Statistical

The Gaussian Distribution over a vector x                                   Machine Learning

                                                                                c 2011
                                                                          Christfried Webers
                                                                                NICTA
    Can find a linear transformation to a new coordinate                 The Australian National
                                                                              University
    system y in which the x becomes
                            y = UT (x − µ) ,
    U is the eigenvector matrix for the covariance matrix Σ
    with eigenvalue matrix E                                            Boxes with Apples and
                                                                      Oranges
                                  λ1 0 . . . 0                          Bayes’ Theorem
                                0 λ2 . . . 0 
                ΣU = UE = U                           
                               . . . . . . . . . . . .
                                                                        Bayes’ Probabilities

                                                                        Probability Distributions

                                  0    0 . . . λD                       Gaussian Distribution
                                                                        over a Vector

    U can be made an orthogonal matrix, therefore the                   Decision Theory

    columns ui of U are unit vectors which are orthogonal to            Model Selection - Key
                                                                        Ideas
                        1 i=j
    each other uT uj =
                i
                        0 i=j
    Now we can write Σ and its inverse (prove that ΣΣ−1 = I)
                    n                                   n
                                                             1
    Σ = U E UT =         λi ui uT
                                i   Σ−1 = U E−1 UT =            ui uT
                                                                    i
                                                             λi
                   i=1                                 i=1
                                                                                         133of 146
Introduction to Statistical

The Gaussian Distribution over a vector x                                    Machine Learning

                                                                                 c 2011
                                                                           Christfried Webers
                                                                                 NICTA
    Now use the linear transformation y = UT (x − µ) and                 The Australian National
                                                                               University
    Σ−1 = U E−1 UT to transform the exponent
    (x − µ)T Σ−1 (x − µ) into
                                                   n
                                                          y2
                                                           j
                                 yT E y =
                                                          λj             Boxes with Apples and
                                                  j=1
                                                                         Oranges

    Now exponating the sum (and taking care of the factors)              Bayes’ Theorem

                                                                         Bayes’ Probabilities
    results in a product of scalar valued Gaussian distributions
                                                                         Probability Distributions
    in orthogonal directions ui
                                                                         Gaussian Distribution
                                                                         over a Vector
                                 D
                                          1                         y2
                                                                     j   Decision Theory
                  p(y) =                          exp{−     }
                                 j=1
                                       (2πλj )1/2       2λj              Model Selection - Key
                                                                         Ideas

                     x2
                                             u2

                                                   u1

                                        y2
                                                          y1
                                             µ
                           1/2
                          λ2
                                                    1/2
                                                   λ1


                                                               x1
                                                                                          134of 146
Introduction to Statistical

Partitioned Gaussians                                                  Machine Learning

                                                                           c 2011
                                                                     Christfried Webers
                                                                           NICTA
    Given a joint Gaussian distribution N (x | µ, Σ), where µ is   The Australian National
                                                                         University
    the mean vector, Σ the covariance matrix, and Λ ≡ Σ−1
    the precision matrix
    Assume that the variables can be partitioned into two sets
          xa           µa             Σaa    Σab       Λaa   Λab
    x=       ,µ=          ,Σ=                    ,Λ=               Boxes with Apples and
                                                                   Oranges
          xb           µb             Σba    Σbb       Λba   Λbb
                                                                   Bayes’ Theorem

                                                                   Bayes’ Probabilities

               Λaa = (Σaa − Σab Σ−1 Σba )−1
                                 bb
                                                                   Probability Distributions

                                                                   Gaussian Distribution
               Λab = −(Σaa − Σab Σ−1 Σba )−1 Σab Σ−1
                                  bb              bb               over a Vector

                                                                   Decision Theory
    Conditional distribution                                       Model Selection - Key
                                                                   Ideas

                 p(xa | xb ) = N (xa | µa|b , Λ−1 )
                                               aa

                      µa|b = µa − Λ−1 Λab (xb − µb )
                                   aa

    Marginal distribution
                       p(xa ) = N (xa | µa , Σaa )
                                                                                    135of 146
Introduction to Statistical

Partitioned Gaussians                                                                         Machine Learning

                                                                                                  c 2011
                                                                                            Christfried Webers
                                                                                                  NICTA
                                                                                          The Australian National
                                                                                                University


         1                                          10

    xb
                 xb = 0.7                                     p(xa |xb = 0.7)

                                                                                          Boxes with Apples and
                                                                                          Oranges

    0.5                                              5                                    Bayes’ Theorem

                                                                                          Bayes’ Probabilities
                                  p(xa , xb )                                             Probability Distributions
                                                             p(xa )                       Gaussian Distribution
                                                                                          over a Vector

                                                                                          Decision Theory
         0                                           0
             0              0.5        xa       1        0                 0.5   xa   1
                                                                                          Model Selection - Key
                                                                                          Ideas

 Contours of a Gaussian distribution over two variables xa and
   xb (left), and marginal distribution p(xa ) and conditional
             distribution p(xa | xb ) for xb = 0.7 (right).



                                                                                                           136of 146
Introduction to Statistical

Nonlinear Change of Variables in Distributions                        Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University
    Given some px (x).
    Consider a nonlinear change of variables

                                  x = g(y)
                                                                  Boxes with Apples and
                                                                  Oranges
    What is the new probability distribution py (y) in terms of
                                                                  Bayes’ Theorem
    the variable y ?                                              Bayes’ Probabilities

                                                                  Probability Distributions
                                            dx
                          py (y) = px (x)                         Gaussian Distribution
                                            dy                    over a Vector

                                                                  Decision Theory
                                = px (g(y)) |g (y)|               Model Selection - Key
                                                                  Ideas

    For vector valued x and y

                            py (y) = px (x) | J |
                  ∂xi
    where Jij =   ∂yj .


                                                                                   137of 146
Introduction to Statistical

Decision Theory - Key Ideas                                         Machine Learning

                                                                        c 2011
                                                                  Christfried Webers
                                                                        NICTA
                                                                The Australian National
                                                                      University



    Two classes C1 and C2
    joint distribution p(x, Ck )
    using Bayes’ theorem                                        Boxes with Apples and
                                                                Oranges
                                      p(x | Ck ) p(Ck )         Bayes’ Theorem
                        p(Ck | x) =
                                            p(x)                Bayes’ Probabilities

                                                                Probability Distributions

    Example: cancer treatment (k = 2)                           Gaussian Distribution
                                                                over a Vector
    data x : an X-ray image                                     Decision Theory

    C1 : patient has cancer (C1 : patient has no cancer)        Model Selection - Key
                                                                Ideas

    p(C1 ) is the prior probability of a person having cancer
    p(C1 | x) is the posterior probability of a person having
    cancer after having seen the X-ray data



                                                                                 138of 146
Introduction to Statistical

Decision Theory - Key Ideas                                              Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
    Need a rule which assigns each value of the input x to one       The Australian National
                                                                           University
    of the available classes.
    The input space is partitioned into decision regions Rk .
    Leads to decision boundaries or decision surfaces
    probability of a mistake
           p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )            Boxes with Apples and
                                                                     Oranges

                                                                     Bayes’ Theorem
                       =        p(x, C2 ) dx +        p(x, C1 ) dx   Bayes’ Probabilities
                           R1                    R2
                                                                     Probability Distributions

                                                                     Gaussian Distribution
               R1                                                    over a Vector

                                                                     Decision Theory

                                                                     Model Selection - Key
                                                                     Ideas
                                                        R2




                                                                                      139of 146
Introduction to Statistical

Decision Theory - Key Ideas                                                      Machine Learning

                                                                                     c 2011
                                                                               Christfried Webers
                                                                                     NICTA
                                                                             The Australian National
    probability of a mistake                                                       University




           p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )

                            =         p(x, C2 ) dx +          p(x, C1 ) dx
                                 R1                      R2                  Boxes with Apples and
                                                                             Oranges

    goal: minimize p(mistake)                                                Bayes’ Theorem

                                                                             Bayes’ Probabilities

                                      x0                                     Probability Distributions
                                             x
                                                                             Gaussian Distribution
                p(x, C1 )                                                    over a Vector

                                                                             Decision Theory
                                                 p(x, C2 )                   Model Selection - Key
                                                                             Ideas




                                                                    x
                            R1                          R2

                                                                                              140of 146
Introduction to Statistical

Decision Theory - Key Ideas                                           Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University




    multiple classes
    instead of minimising the probability of mistakes, maximise   Boxes with Apples and
                                                                  Oranges
    the probability of correct classification
                                                                  Bayes’ Theorem

                                 K                                Bayes’ Probabilities

                  p(correct) =         p(x ∈ Rk , Ck )            Probability Distributions

                                                                  Gaussian Distribution
                                 k=1                              over a Vector
                                 K                                Decision Theory

                            =                p(x, Ck ) dx         Model Selection - Key
                                                                  Ideas
                                 k=1    Rk




                                                                                   141of 146
Introduction to Statistical

Minimising the Expected Loss                                                Machine Learning

                                                                                c 2011
                                                                          Christfried Webers
                                                                                NICTA
                                                                        The Australian National
                                                                              University




    Not all mistakes are equally costly.
    Weight each misclassification of x to the wrong class Cj             Boxes with Apples and
                                                                        Oranges
    instead of assigning it to the correct class Ck by a factor Lkj .   Bayes’ Theorem

    The expected loss is now                                            Bayes’ Probabilities

                                                                        Probability Distributions

                                                                        Gaussian Distribution
                   E [L] =                Lkj p(x, Ck )dx               over a Vector

                             k   j   Rj                                 Decision Theory

                                                                        Model Selection - Key
                                                                        Ideas
    Goal: minimize the expected loss E [L]




                                                                                         142of 146
Introduction to Statistical

The Reject Region                                                  Machine Learning

                                                                       c 2011
                                                                 Christfried Webers
                                                                       NICTA
                                                               The Australian National
                                                                     University



    Avoid making automated decisions on difficult cases.
    Difficult cases:
        posterior probabilities p(Ck | x) are very small
                                                               Boxes with Apples and
        joint distributions p(x, Ck ) have comparable values   Oranges

                                                               Bayes’ Theorem
                      p(C1 |x)                   p(C2 |x)
                1.0                                            Bayes’ Probabilities
                  θ                                            Probability Distributions

                                                               Gaussian Distribution
                                                               over a Vector

                                                               Decision Theory

                                                               Model Selection - Key
                                                               Ideas




                0.0                                     x
                                 reject region




                                                                                143of 146
Introduction to Statistical

S-fold Cross Validation                                                         Machine Learning

                                                                                    c 2011
                                                                              Christfried Webers
                                                                                    NICTA
                                                                            The Australian National
                                                                                  University


    Given a set of N data items and targets.
    Goal: Find the best model (type of model, number of
    parameters like the order p of the polynomial or the
    regularisation constant λ). Avoid overfitting.                           Boxes with Apples and
                                                                            Oranges
    Solution: Train a machine learning algorithm with some of               Bayes’ Theorem
    the data, evaluate it with the rest.                                    Bayes’ Probabilities

    If we have many data                                                    Probability Distributions

                                                                            Gaussian Distribution
      1   Train a range of models or a model with a range of                over a Vector
          parameters.                                                       Decision Theory
      2   Compare the performance on an independent data set                Model Selection - Key
          (validation set) and choose the one with the best predictive      Ideas

          performance.
      3   Still, overfitting to the validation set can occur. Therefore,
          use a third test set for final evaluation. (Keep the test set in
          a safe and never give it to the developers ;–)


                                                                                             144of 146
Introduction to Statistical

S-fold Cross Validation                                                  Machine Learning

                                                                             c 2011
                                                                       Christfried Webers
                                                                             NICTA
                                                                     The Australian National
                                                                           University




    For few data, there is a dilemma: Few training data or few       Boxes with Apples and
    test data.                                                       Oranges

                                                                     Bayes’ Theorem
    Solution is cross-validation.
                                                                     Bayes’ Probabilities
    Use a portion of (S − 1)/S of the available data for training,   Probability Distributions

    but use all the data to asses the performance.                   Gaussian Distribution
                                                                     over a Vector
    For very scarce data one may use S = N, which is also            Decision Theory

    called the leave-one-out technique.                              Model Selection - Key
                                                                     Ideas




                                                                                      145of 146
Introduction to Statistical

S-fold Cross Validation                                               Machine Learning

                                                                          c 2011
                                                                    Christfried Webers
                                                                          NICTA
                                                                  The Australian National
                                                                        University


    Partition data into S groups.
    Use S − 1 groups to train a set of models that are then
    evaluated on the remaining group.
    Repeat for all S choices of the held-out group, and average   Boxes with Apples and
                                                                  Oranges
    the performance scores from the S runs.                       Bayes’ Theorem

                                                                  Bayes’ Probabilities

                                           run 1                  Probability Distributions

                                                                  Gaussian Distribution
                                                                  over a Vector
                                           run 2                  Decision Theory

                                                                  Model Selection - Key
                                           run 3                  Ideas



                                           run 4
                      Example for S = 4.


                                                                                   146of 146

Contenu connexe

En vedette

Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tearsAnkit Sharma
 
Final year project presentation in android application
Final year project presentation in android applicationFinal year project presentation in android application
Final year project presentation in android applicationChirag Thaker
 
Android Project Presentation
Android Project PresentationAndroid Project Presentation
Android Project PresentationLaxmi Kant Yadav
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

En vedette (8)

Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Decision theory
Decision theoryDecision theory
Decision theory
 
Lecture10 - Naïve Bayes
Lecture10 - Naïve BayesLecture10 - Naïve Bayes
Lecture10 - Naïve Bayes
 
Android ppt
Android pptAndroid ppt
Android ppt
 
Final year project presentation in android application
Final year project presentation in android applicationFinal year project presentation in android application
Final year project presentation in android application
 
Android Project Presentation
Android Project PresentationAndroid Project Presentation
Android Project Presentation
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Plus de nep_test_account (18)

Database slide
Database slideDatabase slide
Database slide
 
database slide 1
database slide 1database slide 1
database slide 1
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
Doc2pages
Doc2pagesDoc2pages
Doc2pages
 
PDF TEST
PDF TESTPDF TEST
PDF TEST
 
Baboom!
Baboom!Baboom!
Baboom!
 
Baboom!
Baboom!Baboom!
Baboom!
 
uat
uatuat
uat
 
test pdf
test pdftest pdf
test pdf
 
08 linear classification_2
08 linear classification_208 linear classification_2
08 linear classification_2
 
linear classification
linear classificationlinear classification
linear classification
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learning
 
Induction of Decision Trees
Induction of Decision TreesInduction of Decision Trees
Induction of Decision Trees
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learning
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Linear Algebra
Linear AlgebraLinear Algebra
Linear Algebra
 
Introduction
IntroductionIntroduction
Introduction
 

Probability

  • 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Introduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 146
  • 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part IV Boxes with Apples and Oranges Bayes’ Theorem Probability and Uncertainty Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas 113of 146
  • 3. Introduction to Statistical Simple Experiment Machine Learning c 2011 Christfried Webers NICTA The Australian National University 1 Choose a box. Red box p(B = r) = 4/10 Blue box p(B = b) = 6/10 2 Choose any item of the selected box with equal probability. Boxes with Apples and Oranges Given that we have chosen an orange, what is the Bayes’ Theorem probability that the box we chose was the blue one? Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas 114of 146
  • 4. Introduction to Statistical What do we know? Machine Learning c 2011 Christfried Webers NICTA The Australian National University p(F = o | B = b) = 1/4 p(F = a | B = b) = 3/4 p(F = o | B = r) = 3/4 Boxes with Apples and p(F = a | B = r) = 1/4 Oranges Bayes’ Theorem Given that we have chosen an orange, what is the Bayes’ Probabilities probability that the box we chose was the blue one? Probability Distributions Gaussian Distribution p(B = b | F = o)? over a Vector Decision Theory Model Selection - Key Ideas 115of 146
  • 5. Introduction to Statistical Calculating the Posterior p(B = b | F = o) Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayes’ Theorem p(F = o|B = b)p(B = b) p(B = b | F = o) = p(F = o) Boxes with Apples and Oranges Sum Rule for the denominator Bayes’ Theorem Bayes’ Probabilities p(F = o) = p(F = o, B = b) + p(F = o, B = r) Probability Distributions = p(F = o|B = b)p(B = b) Gaussian Distribution over a Vector + p(F = o|B = r)p(B = r) Decision Theory Model Selection - Key 1 6 3 4 9 Ideas = × + × = 4 10 4 10 20 1 6 20 1 p(B = b | F = o) = × × = 4 10 9 3 116of 146
  • 6. Introduction to Statistical Bayes’ Theorem - revisited Machine Learning c 2011 Christfried Webers NICTA The Australian National University Before choosing an item from a box: most complete information in p(B) (prior). 6 Note, that in our example p(B = b) = 10 . Therefore Boxes with Apples and Oranges choosing the box from the prior, we would opt for the blue Bayes’ Theorem box. Bayes’ Probabilities Once we observe some data (e.g. choose an orange), we Probability Distributions can calculate p(B = b | F = o) (posterior probability) via Gaussian Distribution over a Vector Bayes’ theorem. Decision Theory After observing an orange, the posterior probability Model Selection - Key 1 2 Ideas p(B = b | F = o) = 3 and therefore p(B = r | F = o) = 3 . Observing an orange it is now more likely that the orange came from the red box. 117of 146
  • 7. Introduction to Statistical Bayes’ Rule Machine Learning c 2011 Christfried Webers NICTA The Australian National University Boxes with Apples and Oranges likelihood prior posterior Bayes’ Theorem p(X | Y) p(Y) p(X | Y) p(Y) Bayes’ Probabilities p(Y | X) = = p(X) Y p(X | Y)p(Y) Probability Distributions Gaussian Distribution normalisation over a Vector Decision Theory Model Selection - Key Ideas 118of 146
  • 8. Introduction to Statistical Bayes’ Probabilities Machine Learning c 2011 Christfried Webers NICTA The Australian National University classical or frequentist interpretation of probabilities Bayesian view: probabilities represent uncertainty Boxes with Apples and Oranges Example: Will the Arctic ice cap have disappeared by the Bayes’ Theorem end of the century? Bayes’ Probabilities Probability Distributions fresh evidence can change the opinion on ice loss Gaussian Distribution over a Vector goal: quantify uncertainty and revise uncertainty in light of Decision Theory new evidence Model Selection - Key Ideas use Bayesian interpretation of probability 119of 146
  • 9. Introduction to Statistical Andrey Kolmogorov - Axiomatic Probability Machine Learning c 2011 Theory (1933) Christfried Webers NICTA The Australian National University Boxes with Apples and Let (Ω, F, P) be a measure space with P(Ω) = 1. Oranges Then (Ω, F, P) is a probability space, with sample space Ω, Bayes’ Theorem Bayes’ Probabilities event space F and probability measure P. Probability Distributions 1. Axiom P(E) ≥ 0 ∀E ∈ F. Gaussian Distribution over a Vector 2. Axiom P(Ω) = 1. Decision Theory 3. Axiom P(E1 ∪ E2 ∪ . . . ) = i P(Ei ) for any countable Model Selection - Key Ideas sequence of pairwise disjoint events E1 , E2 , . . . . 120of 146
  • 10. Introduction to Statistical Richard Threlkeld Cox - (1946, 1961) Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume numerical values are used to represent degrees of belief. Define a set of axioms encoding common sense Boxes with Apples and properties of such beliefs. Oranges Bayes’ Theorem Results in a set of rules for manipulating degrees of belief Bayes’ Probabilities which are equivalent to the sum and product rule of Probability Distributions probability. Gaussian Distribution over a Vector many other authors have proposed different sets of Decision Theory axioms and properties Model Selection - Key Ideas Result: the numerical quantities behave all according to the rules of probability Denote these quantities as Bayesian probabilities. 121of 146
  • 11. Introduction to Statistical Curve fitting - revisited Machine Learning c 2011 Christfried Webers NICTA The Australian National University uncertainty about the parameter w captured in the prior probability p(w) observed data D = {t1 , . . . , tN } Boxes with Apples and Oranges calculate the uncertainty in w after the data D have been Bayes’ Theorem observed Bayes’ Probabilities p(D | w)p(w) Probability Distributions p(w | D) = p(D) Gaussian Distribution over a Vector p(D | w) as a function of w : likelihood function Decision Theory Model Selection - Key likelihood expresses how probable the data are for Ideas different values of w not a probability function over w 122of 146
  • 12. Introduction to Statistical Likelihood Function - Frequentist versus Bayesian Machine Learning c 2011 Christfried Webers NICTA The Australian National University Likelihood function p(D | w) Frequentist Approach Bayesian Approach Boxes with Apples and Oranges w considered fixed only one single data Bayes’ Theorem parameter set D Bayes’ Probabilities value defined by uncertainty in the Probability Distributions some ’estimator’ parameters comes Gaussian Distribution over a Vector error bars on the from a probability Decision Theory estimated w distribution over w Model Selection - Key Ideas obtained from the distribution of possible data sets D 123of 146
  • 13. Introduction to Statistical Frequentist Estimator - Maximum Likelihood Machine Learning c 2011 Christfried Webers NICTA The Australian National University choose w for which the likelihood p(D | w) is maximal choose w for which the probability of the observed data is maximal Boxes with Apples and Oranges Machine Learning: error function is negative log of Bayes’ Theorem likelihood function Bayes’ Probabilities log is a monoton function Probability Distributions Gaussian Distribution maximising the likelihood ⇐⇒ minimising the error over a Vector Decision Theory Example: Fair-looking coin is tossed three times, always Model Selection - Key landing on heads. Ideas Maximum likelihood estimate of the probability of landing heads will give 1. 124of 146
  • 14. Introduction to Statistical Bayesian Approach Machine Learning c 2011 Christfried Webers NICTA The Australian National University including prior knowledge easy (via prior w) BUT: if prior is badly chosen, can lead to bad results Boxes with Apples and subjective choice of prior Oranges Bayes’ Theorem sometimes choice of prior motivated by convinient Bayes’ Probabilities mathematical form Probability Distributions need to sum/integrate over the whole parameter space Gaussian Distribution over a Vector advances in sampling (Markov Chain Monte Carlo Decision Theory methods) Model Selection - Key Ideas advances in approximation schemes (Variational Bayes, Expectation Propagation) 125of 146
  • 15. Introduction to Statistical The Gaussian Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University x∈R Gaussian Distribution with mean µ and variance σ 2 1 1 Boxes with Apples and N (x | µ, σ 2 ) = 1 exp{− (x − µ)2 } Oranges (2πσ 2 ) 2 2σ 2 Bayes’ Theorem Bayes’ Probabilities N (x|µ, σ 2 ) Probability Distributions Gaussian Distribution over a Vector Decision Theory 2σ Model Selection - Key Ideas µ x 126of 146
  • 16. Introduction to Statistical The Gaussian Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University N (x | µ, σ 2 ) > 0 ∞ −∞ N (x | µ, σ 2 ) dx = 1 Expectation over x Boxes with Apples and ∞ Oranges 2 E [x] = N (x | µ, σ ) x dx = µ Bayes’ Theorem −∞ Bayes’ Probabilities Probability Distributions 2 Expectation over x Gaussian Distribution over a Vector ∞ Decision Theory E x2 = N (x | µ, σ 2 ) x2 dx = µ2 + σ 2 Model Selection - Key −∞ Ideas Variance of x 2 var[x] = E x2 − E [x] = σ 2 127of 146
  • 17. Introduction to Statistical The Mode of a Probability Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Mode of a distribution : the value that occurs the most frequently in a probability distribution. For a probability density function: the value x at which the probability density attains its maximum. Boxes with Apples and Gaussian Distribution has one mode (unimodular); the Oranges mode is µ. Bayes’ Theorem Bayes’ Probabilities If there are multiple local maxima in the probability Probability Distributions distribution, the probability distribution is called multimodal Gaussian Distribution (example: mixture of three Gaussians). over a Vector Decision Theory N (x|µ, σ 2 ) p(x) Model Selection - Key Ideas 2σ µ x x 128of 146
  • 18. Introduction to Statistical The Bernoulli Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two possible outcomes x ∈ {0, 1} (e.g. coin which may be damaged). p(x = 1 | µ) = µ for 0 ≤ µ ≤ 1 Boxes with Apples and Oranges p(x = 0 | µ) = 1 − µ Bayes’ Theorem Bernoulli Distribution Bayes’ Probabilities Probability Distributions Bern(x | µ) = µx (1 − µ)1−x Gaussian Distribution over a Vector Decision Theory Expectation over x Model Selection - Key E [x] = µ Ideas Variance of x var[x] = µ(1 − µ) 129of 146
  • 19. Introduction to Statistical The Binomial Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Flip a coin N times. What is the distribution to observe heads exactly m times? This is a distribution over m = {0, . . . , N}. Binomial Distribution Boxes with Apples and Oranges N m Bayes’ Theorem Bin(m | N, µ) = µ (1 − µ)N−m m Bayes’ Probabilities Probability Distributions 0.3 Gaussian Distribution over a Vector Decision Theory 0.2 Model Selection - Key Ideas 0.1 0 0 1 2 3 4 5 6 7 8 9 10 m N = 10, µ = 0.25 130of 146
  • 20. Introduction to Statistical The Beta Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National Beta Distribution University Γ(a + b) a−1 Beta(µ | a, b) = µ (1 − µ)b−1 Γ(a)Γ(b) 3 3 Boxes with Apples and a = 0.1 a=1 Oranges b = 0.1 b=1 2 2 Bayes’ Theorem Bayes’ Probabilities 1 1 Probability Distributions Gaussian Distribution over a Vector 0 0 0 0.5 µ 1 0 0.5 µ 1 Decision Theory 3 3 Model Selection - Key a=2 a=8 Ideas b=3 b=4 2 2 1 1 0 0 0 0.5 µ 1 0 0.5 µ 1 131of 146
  • 21. Introduction to Statistical The Gaussian Distribution over a Vector x Machine Learning c 2011 Christfried Webers NICTA The Australian National University x ∈ RD Gaussian Distribution with mean µ ∈ RD and covariance matrix Σ ∈ RD×D 1 1 1 N (x | µ, Σ) = D 1 exp{− (x − µ)T Σ−1 (x − µ)} Boxes with Apples and Oranges (2π) 2 |Σ| 2 2 Bayes’ Theorem Bayes’ Probabilities where |Σ| is the determinant of Σ. Probability Distributions x2 Gaussian Distribution u2 over a Vector u1 Decision Theory Model Selection - Key y2 Ideas y1 µ 1/2 λ2 1/2 λ1 x1 132of 146
  • 22. Introduction to Statistical The Gaussian Distribution over a vector x Machine Learning c 2011 Christfried Webers NICTA Can find a linear transformation to a new coordinate The Australian National University system y in which the x becomes y = UT (x − µ) , U is the eigenvector matrix for the covariance matrix Σ with eigenvalue matrix E Boxes with Apples and   Oranges λ1 0 . . . 0 Bayes’ Theorem  0 λ2 . . . 0  ΣU = UE = U  . . . . . . . . . . . . Bayes’ Probabilities Probability Distributions 0 0 . . . λD Gaussian Distribution over a Vector U can be made an orthogonal matrix, therefore the Decision Theory columns ui of U are unit vectors which are orthogonal to Model Selection - Key Ideas 1 i=j each other uT uj = i 0 i=j Now we can write Σ and its inverse (prove that ΣΣ−1 = I) n n 1 Σ = U E UT = λi ui uT i Σ−1 = U E−1 UT = ui uT i λi i=1 i=1 133of 146
  • 23. Introduction to Statistical The Gaussian Distribution over a vector x Machine Learning c 2011 Christfried Webers NICTA Now use the linear transformation y = UT (x − µ) and The Australian National University Σ−1 = U E−1 UT to transform the exponent (x − µ)T Σ−1 (x − µ) into n y2 j yT E y = λj Boxes with Apples and j=1 Oranges Now exponating the sum (and taking care of the factors) Bayes’ Theorem Bayes’ Probabilities results in a product of scalar valued Gaussian distributions Probability Distributions in orthogonal directions ui Gaussian Distribution over a Vector D 1 y2 j Decision Theory p(y) = exp{− } j=1 (2πλj )1/2 2λj Model Selection - Key Ideas x2 u2 u1 y2 y1 µ 1/2 λ2 1/2 λ1 x1 134of 146
  • 24. Introduction to Statistical Partitioned Gaussians Machine Learning c 2011 Christfried Webers NICTA Given a joint Gaussian distribution N (x | µ, Σ), where µ is The Australian National University the mean vector, Σ the covariance matrix, and Λ ≡ Σ−1 the precision matrix Assume that the variables can be partitioned into two sets xa µa Σaa Σab Λaa Λab x= ,µ= ,Σ= ,Λ= Boxes with Apples and Oranges xb µb Σba Σbb Λba Λbb Bayes’ Theorem Bayes’ Probabilities Λaa = (Σaa − Σab Σ−1 Σba )−1 bb Probability Distributions Gaussian Distribution Λab = −(Σaa − Σab Σ−1 Σba )−1 Σab Σ−1 bb bb over a Vector Decision Theory Conditional distribution Model Selection - Key Ideas p(xa | xb ) = N (xa | µa|b , Λ−1 ) aa µa|b = µa − Λ−1 Λab (xb − µb ) aa Marginal distribution p(xa ) = N (xa | µa , Σaa ) 135of 146
  • 25. Introduction to Statistical Partitioned Gaussians Machine Learning c 2011 Christfried Webers NICTA The Australian National University 1 10 xb xb = 0.7 p(xa |xb = 0.7) Boxes with Apples and Oranges 0.5 5 Bayes’ Theorem Bayes’ Probabilities p(xa , xb ) Probability Distributions p(xa ) Gaussian Distribution over a Vector Decision Theory 0 0 0 0.5 xa 1 0 0.5 xa 1 Model Selection - Key Ideas Contours of a Gaussian distribution over two variables xa and xb (left), and marginal distribution p(xa ) and conditional distribution p(xa | xb ) for xb = 0.7 (right). 136of 146
  • 26. Introduction to Statistical Nonlinear Change of Variables in Distributions Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given some px (x). Consider a nonlinear change of variables x = g(y) Boxes with Apples and Oranges What is the new probability distribution py (y) in terms of Bayes’ Theorem the variable y ? Bayes’ Probabilities Probability Distributions dx py (y) = px (x) Gaussian Distribution dy over a Vector Decision Theory = px (g(y)) |g (y)| Model Selection - Key Ideas For vector valued x and y py (y) = px (x) | J | ∂xi where Jij = ∂yj . 137of 146
  • 27. Introduction to Statistical Decision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two classes C1 and C2 joint distribution p(x, Ck ) using Bayes’ theorem Boxes with Apples and Oranges p(x | Ck ) p(Ck ) Bayes’ Theorem p(Ck | x) = p(x) Bayes’ Probabilities Probability Distributions Example: cancer treatment (k = 2) Gaussian Distribution over a Vector data x : an X-ray image Decision Theory C1 : patient has cancer (C1 : patient has no cancer) Model Selection - Key Ideas p(C1 ) is the prior probability of a person having cancer p(C1 | x) is the posterior probability of a person having cancer after having seen the X-ray data 138of 146
  • 28. Introduction to Statistical Decision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA Need a rule which assigns each value of the input x to one The Australian National University of the available classes. The input space is partitioned into decision regions Rk . Leads to decision boundaries or decision surfaces probability of a mistake p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) Boxes with Apples and Oranges Bayes’ Theorem = p(x, C2 ) dx + p(x, C1 ) dx Bayes’ Probabilities R1 R2 Probability Distributions Gaussian Distribution R1 over a Vector Decision Theory Model Selection - Key Ideas R2 139of 146
  • 29. Introduction to Statistical Decision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA The Australian National probability of a mistake University p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) = p(x, C2 ) dx + p(x, C1 ) dx R1 R2 Boxes with Apples and Oranges goal: minimize p(mistake) Bayes’ Theorem Bayes’ Probabilities x0 Probability Distributions x Gaussian Distribution p(x, C1 ) over a Vector Decision Theory p(x, C2 ) Model Selection - Key Ideas x R1 R2 140of 146
  • 30. Introduction to Statistical Decision Theory - Key Ideas Machine Learning c 2011 Christfried Webers NICTA The Australian National University multiple classes instead of minimising the probability of mistakes, maximise Boxes with Apples and Oranges the probability of correct classification Bayes’ Theorem K Bayes’ Probabilities p(correct) = p(x ∈ Rk , Ck ) Probability Distributions Gaussian Distribution k=1 over a Vector K Decision Theory = p(x, Ck ) dx Model Selection - Key Ideas k=1 Rk 141of 146
  • 31. Introduction to Statistical Minimising the Expected Loss Machine Learning c 2011 Christfried Webers NICTA The Australian National University Not all mistakes are equally costly. Weight each misclassification of x to the wrong class Cj Boxes with Apples and Oranges instead of assigning it to the correct class Ck by a factor Lkj . Bayes’ Theorem The expected loss is now Bayes’ Probabilities Probability Distributions Gaussian Distribution E [L] = Lkj p(x, Ck )dx over a Vector k j Rj Decision Theory Model Selection - Key Ideas Goal: minimize the expected loss E [L] 142of 146
  • 32. Introduction to Statistical The Reject Region Machine Learning c 2011 Christfried Webers NICTA The Australian National University Avoid making automated decisions on difficult cases. Difficult cases: posterior probabilities p(Ck | x) are very small Boxes with Apples and joint distributions p(x, Ck ) have comparable values Oranges Bayes’ Theorem p(C1 |x) p(C2 |x) 1.0 Bayes’ Probabilities θ Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas 0.0 x reject region 143of 146
  • 33. Introduction to Statistical S-fold Cross Validation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a set of N data items and targets. Goal: Find the best model (type of model, number of parameters like the order p of the polynomial or the regularisation constant λ). Avoid overfitting. Boxes with Apples and Oranges Solution: Train a machine learning algorithm with some of Bayes’ Theorem the data, evaluate it with the rest. Bayes’ Probabilities If we have many data Probability Distributions Gaussian Distribution 1 Train a range of models or a model with a range of over a Vector parameters. Decision Theory 2 Compare the performance on an independent data set Model Selection - Key (validation set) and choose the one with the best predictive Ideas performance. 3 Still, overfitting to the validation set can occur. Therefore, use a third test set for final evaluation. (Keep the test set in a safe and never give it to the developers ;–) 144of 146
  • 34. Introduction to Statistical S-fold Cross Validation Machine Learning c 2011 Christfried Webers NICTA The Australian National University For few data, there is a dilemma: Few training data or few Boxes with Apples and test data. Oranges Bayes’ Theorem Solution is cross-validation. Bayes’ Probabilities Use a portion of (S − 1)/S of the available data for training, Probability Distributions but use all the data to asses the performance. Gaussian Distribution over a Vector For very scarce data one may use S = N, which is also Decision Theory called the leave-one-out technique. Model Selection - Key Ideas 145of 146
  • 35. Introduction to Statistical S-fold Cross Validation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Partition data into S groups. Use S − 1 groups to train a set of models that are then evaluated on the remaining group. Repeat for all S choices of the held-out group, and average Boxes with Apples and Oranges the performance scores from the S runs. Bayes’ Theorem Bayes’ Probabilities run 1 Probability Distributions Gaussian Distribution over a Vector run 2 Decision Theory Model Selection - Key run 3 Ideas run 4 Example for S = 4. 146of 146