SlideShare a Scribd company logo
1 of 41
Download to read offline
CPSC-540: Machine Learning                      1   CPSC-540: Machine Learning                                 2


                                    .               Lecture 2 - Google’s PageRank:
                                                    Why math helps
                                                    OBJECTIVE: Motivate linear algebra and probability
                                                    as important and necessary tools for understanding large
                                                    datasets. We also describe the algorithm at the core of the
               Machine Learning                     Google search engine.


                       Nando de Freitas             3 PAGERANK
                             January 16, 2007
                                                      Consider the following mini-web of 3 pages (the data):
                                                                                      0.1


                                                                                 x2
                                                                           1
                                                                                            0.9


                                                                                      0.4
                                                                      x1                          x3

                                                                                 0.6


                                                    The nodes are the webpages and the arrows are links. The
CPSC-540: Machine Learning                                 3   CPSC-540: Machine Learning                               4


numbers are the normalised number of links. We can re-write    N = 100, iterations we get:
this directed graph as a transition matrix:
                                                                                    π T T N = (0.2, 0.4, 0.4)

                                                               We soon notice that no matter what initial π we choose, we
                                                               always converge to p = (0.2, 0.4, 0.4)T . So




                                                                                            pT T =
T is a stochastic matrix: its columns add up to 1, so
that
                         Ti,j = P (xj |xi)

                                 Ti,j = 1
                             j

In information retrieval, we want to know the “relevance” of
each webpage. That is, we want to compute the probability
of each webpage: p(xi) for i = 1, 2, 3.
Let’s start with a random guess π = (0.5, 0.2, 0.3)T and
“crawl the web” (multiply by T several times). After, say
CPSC-540: Machine Learning                                          5   CPSC-540: Machine Learning                                  6


The distribution p is a measure of the relevance of each page.               That is, the matrix T cannot be reduced to separate
Google uses this. But will this work always? When does it                    smaller matrices, which is also the same as stating that
fail?                                                                        the transition graph is connected.

                                                                         2. Aperiodicity: The chain should not get trapped in
                                                                             cycles.

                                                                        Google’s strategy is to add am matrix of uniform noise E to
                                                                        T:
                                                                                                     L=T+ E

                                                                        where     is a small number. L is then normalised. This
                                                                        ensures irreducibility.
                                                                        How quickly does this algorithm converge? What determines

The Perron-Frobenius Theorem tell us that for any                       the rate of convergence? Again matrix algebra and spectral

starting point, the chain will converge to the invariant dis-           theory provide the answers:

tribution p, as long as T is a stochastic transition matrix
that obeys the following properties:

 1. Irreducibility: For any state of the Markov chain,
        there is a positive probability of visiting all other states.
CPSC-540: Machine Learning   7   CPSC-540: Machine Learning                                  8


                                 Lecture 3 - The Singular Value
                                 Decomposition                      (SVD)
                                 OBJECTIVE: The SVD is a matrix factorization that has
                                 many applications: e.g., information retrieval, least-squares
                                 problems, image processing.


                                 3 EIGENVALUE DECOMPOSITION

                                   Let A ∈ Rm×m. If we put the eigenvalues of A into a
                                 diagonal matrix Λ and gather the eigenvectors into a matrix
                                 X, then the eigenvalue decomposition of A is given by

                                                              A = XΛX−1.

                                 But what if A is not a square matrix? Then the SVD comes
                                 to the rescue.
CPSC-540: Machine Learning                                 9   CPSC-540: Machine Learning                                         10


3 FORMAL DEFINITION OF THE SVD                                 i.e.,

  Given A ∈ Rm×n, the SVD of A is a factorization of the         A v1 v2 . . . vn
                                                                                                                                 
form
                                                                                                             σ1
                                                                                                                                 
                             A = UΣVT                                                                                            
                                                                                                                 σ2              
                                                                                   =                                             
                                                                                        u1 u2 . . . un                ...        
where u are the left singular vectors, σ are the singular                                                                        
                                                                                                                                 
values and v are the right singular vectors.                                                                                 σn

                                                               or AV = UΣ.
Σ ∈ Rn×n is diagonal with positive entries (singular values
in the diagonal).
U ∈ Rm×n with orthonormal columns.
V ∈ Rn×n with orthonormal columns.
(⇒ V is orthogonal so V−1 = VT )


  The equations relating the right singular values {vj } and
the left singular vectors {uj } are

                Avj = σj uj      j = 1, 2, . . . , n
CPSC-540: Machine Learning                               11   CPSC-540: Machine Learning                              12


 1. There is no assumption that m ≥ n or that A has full      3 EIGENVALUE DECOMPOSITION
    rank.
                                                              Theorem 2 The nonzero singular values of A are the
 2. All diagonal elements of Σ are non-negative and in non-
                                                              (positive) square roots of the nonzero eigenvalues of AT A
    increasing order:
                                                              or AAT (these matrices have the same nonzero eigenval-

                       σ1 ≥ σ2 ≥ . . . ≥ σp ≥ 0               ues).

                                                                  Proof:
    where p = min (m, n)



Theorem 1 Every matrix A ∈ Rm×n has singular value
decomposition A = UΣVT
Furthermore, the singular values {σj } are uniquely de-
termined.
If A is square and σi = σj for all i = j, the left singu-
lar vectors {uj } and the right singular vectors {vj } are
uniquely determined to within a factor of ±1.
CPSC-540: Machine Learning                                   13   CPSC-540: Machine Learning                             14


3 LOW-RANK APPROXIMATIONS                                           Another way to understand the SVD is to consider how a

                                                        Ax        matrix may be represented by a sum of rank-one matrices.
Theorem 3 A          2   = σ1, where A   2   = maxx=0    x   =
max          Ax .                                                 Theorem 4
      x =1                                                                                       r
                                                                                                             T
                                                                                           A=         σj uj vj
    Proof:                                                                                      j=1

                                                                  where r is the rank of A.

                                                                      Proof:
CPSC-540: Machine Learning                              15   CPSC-540: Machine Learning                               16


  What is so useful about this expansion is that the ν th    Lecture 4 - Fun with the SVD
partial sum captures as much of the “energy” of A as
                                                             OBJECTIVE: Applications of the SVD to image com-
possible by a matrix of at most rank-ν. In this case, “en-
                                                             pression, dimensionality reduction, visualization, informa-
ergy” is defined by the 2-norm.
                                                             tion retrieval and latent semantic analysis.

Theorem 5 For any ν with 0 ≤ ν ≤ r define
                                ν
                                            T
                                                             3 IMAGE COMPRESSION EXAMPLE
                        Aν =         σj uj vj
                               j=1
                                                             load clown.mat;
If ν = p = min(m, n), define σν+1 = 0.                        figure(1)
Then,                                                        colormap(’gray’)
                        A − Aν       2   = σν+1              image(A);


                                                             [U,S,V] = svd(A);
                                                             figure(2)
                                                             k = 20;
                                                             colormap(’gray’)
                                                             image(U(:,1:k)*S(1:k,1:k)*V(:,1:k)’);
CPSC-540: Machine Learning                                                        17   CPSC-540: Machine Learning                                  18


The code loads a clown image into a 200 × 320 array A;                                                              u1σ1v1             u2σ2v2


displays the image in one figure; performs a singular value
decomposition on A; and displays the image obtained from a
rank-20 SVD approximation of A in another figure. Results
                                                                                               u3σ3v3               u4σ4v4             1+2+3+4
are displayed below:


 20                                       20


 40                                       40


 60                                       60

                                                                                               u σ v                u σ v            1+2+3+4+5+6
 80                                       80                                                    5 5 5                6 6 6

100                                      100


120                                      120


140                                      140


160                                      160


180                                      180


200                                      200
      50   100   150   200   250   300         50   100   150   200   250   300




The original storage requirements for A are 200 · 320 =                                  Smaller eigenvectors capture high frequency variations (small
64, 000, whereas the compressed representation requires (200+                          brush-strokes).
300 + 1) · 20 ≈ 10, 000 storage locations.
CPSC-540: Machine Learning                                19   CPSC-540: Machine Learning                              20


3 TEXT RETRIEVAL - LSI                                         If we truncate the approximation to the k-largest singular
  The SVD can be used to cluster documents and carry           values, we have
                                                                                                       T
out information retrieval by using concepts as opposed to                                   A = Uk Σk Vk
word-matching. This enables us to surmount the problems
                                                               So
of synonymy (car,auto) and polysemy (money bank, river
                                                                                        Vk = Σ−1UT A
                                                                                         T
                                                                                              k  k
bank). The data is available in a term-frequency matrix
CPSC-540: Machine Learning                                21   CPSC-540: Machine Learning                                   22


  In English, A is projected to a lower-dimensional space      3 PRINCIPAL COMPONENT ANALYSIS (PCA)
spanned by the k singular vectors Uk (eigenvectors of AAT ).     The columns of UΣ are called the principal compo-
To carry out retrieval, a query q ∈ R is first projected
                                            n
                                                               nents of A. We can project high-dimensional data to these
to the low-dimensional space:                                  components in order to be able to visualize it. This idea is
                                                               also useful for cleaning data as discussed in the previous text
                             qk = Σ−1UT q
                                   k  k
                                                               retrieval example.
And then we measure the angle between qk and the vk .
CPSC-540: Machine Learning                                 23   CPSC-540: Machine Learning                                              24


  For example, we can take several 16 × 16 images of the                8



digit 2 and project them to 2D. The images can be written               6




as vectors with 256 entries. We then from the matrix A ∈
                                                                        4


                                                                        2
Rn×256, carry out the SVD and truncate it to k = 2. Then
                                                                        0

the components Uk Σk are 2 vectors with n data entries. We
                                                                       −2

can plot these 2D points on the screen to visualize the data.          −4


                                                                       −6


                                                                       −8


                                                                      −10
                                                                       −12    −11        −10   −9     −8     −7   −6         −5   −4




                                                                       20



                                                                       40



                                                                       60



                                                                       80



                                                                      100



                                                                      120



                                                                      140


                                                                                    50          100        150         200        250
CPSC-540: Machine Learning                                  25   CPSC-540: Machine Learning                    26


Lecture 5 - Probability Revision                                 Why do we need measure?


OBJECTIVE: Revise the fundamental concepts of prob-
ability, including marginalization, conditioning, Bayes rule
and expectation.


3 PROBABILITY
  Probability theory is the formal study of the laws of
chance. It is our tool for dealing with uncertainty. Notation:
                                                                 Frequentist Perspective
  • Sample space: is the set Ω of all outcomes of an
    experiment.                                                  Let probability be the frequency of events.

  • Outcome: what we observed. We use ω ∈ Ω to
    denote a particular outcome. e.g. for a die we have
    Ω = {1, 2, 3, 4, 5, 6} and ω could be any of these six
    numbers.

  • Event: is a subset of Ω that is well defined (measur-
    able). e.g. the event A = {even} if w ∈ {2, 4, 6}
CPSC-540: Machine Learning                                 27   CPSC-540: Machine Learning                                 28


Axiomatic Perspective                                            1. P (∅) = 0 ≤ p(A) ≤ 1 = P (Ω)

The frequentist interpretation has some shortcomings when        2. For disjoint sets An, n ≥ 1, we have
                                                                                             ∞              ∞
we ask ourselves questions like
                                                                                     P             An   =         P (An)
  • what is the probability that David will sleep with Anne?                                 n=1            n=1


  • What is the probability that the Panama Canal is
    longer than the Suez Canal?

The axiomatic view is a more elegant mathematical solu-
tion. Here, a probabilistic model consists of the triple
                                                                  If the sets overlap:
(Ω, F, P ), where Ω is the sample space, F is the sigma-field
(collection of measurable events) and P is a function map-
ping F to the interval [0, 1]. That is, with each event A ∈ F
we associate a probability P (A).
  Some outcomes are not measurable so we have to assign
probabilities to F and not Ω. Fortunately, in this course
everything will be measurable so we need no concern our-
selves with measure theory. We do have to make sure the                     P (A + B) = P (A) + P (B) − P (AB)
following two axioms apply:
CPSC-540: Machine Learning                                   29   CPSC-540: Machine Learning                                30


If the events A and B are independent, we have P (AB) =           Conditional Probability
P (A)P (B).

                                                                                                   P (AB)
    Let P (HIV ) = 1/500 be the probability of contract-                                 P (A|B)
                                                                                                    P (B)
  ing HIV by having unprotected sex. If one has unpro-
                                                                  where P (A|B) is the conditional probability of A given
  tected sex twice, the probability of contracting HIV be-
                                                                  that B occurs, P (B) is the marginal probability of B
  comes:
                                                                  and P (AB) is the joint probability of A and B. In
                                                                  general, we obtain a chain rule

                                                                  P (A1:n) = P (An|A1:n−1)P (An−1|A1:n−2) . . . P (A2|A1)P (A1)



  What if we have unprotected sex 500 times?
CPSC-540: Machine Learning                                     31   CPSC-540: Machine Learning                               32


    Assume we have an urn with 3 red balls and 1 blue                   Proof:
  ball: U = {r, r, r, b}. What is the probability of drawing
  (without replacement) 2 red balls in the first 2 tries?




                                                                        What is the probability that the second ball drawn
                                                                      from our urn will be red?


Marginalisation

                                         n
Let the sets B1:n be disjoint and        i=1 Bi   = Ω. Then
                                n
                      P (A) =         P (A, Bi)
                                i=1
                                                                    Bayes Rule

                                                                    Bayes rule allows us to reverse probabilities:

                                                                                                    P (B|A)P (A)
                                                                                        P (A|B) =
                                                                                                        P (B)
CPSC-540: Machine Learning                                     33   CPSC-540: Machine Learning                                   34


Combinining this with marginalisation, we obtain a powerful         such that for all x ∈ E we have {w|X(w) ≤ x} ∈ F. Since
tool for statistical modelling:                                     F denotes the measurable sets, this condition simply says
                                                                    that we can compute (measure) the probability P (X = x).

                              P (data|modeli)P (modeli)
   P (modeli|data) =         M                                          Assume we are throwing a die and are interested in
                             j=1 P (data|modelj )P (modelj )
                                                                      the events E = {even, odd}. Here Ω = {1, 2, 3, 4, 5, 6}.
                                                                      The r.v. takes the value X(w) = even if w ∈ {2, 4, 6}
That is, if we have prior probabilities for each model and            and X(w) = odd if w ∈ {1, 3, 5}. We describe this r.v.
generative data models, we can compute how likely each                with a probability distribution p(xi) = P (X =
model is a posteriori (in light of our prior knowledge and            xi) = 1 , i = 1, . . . , 2
                                                                            2
the evidence brought in by the data).


Discrete random variables

Let E be a discrete set, e.g. E = {0, 1}. A discrete
                                                                    The cumulative distribution function is defined as
random variable (r.v.) is a map from Ω to E:
                                                                    F (x) = P (X ≤ x) and would for this example be:
                         X(w) : Ω → E
CPSC-540: Machine Learning                                 35        CPSC-540: Machine Learning                                   36


                                                                     Expectation of Discrete Random Variables

                                                                     The expectation of a discrete random variable X is

                                                                                            E[X] =       xip(xi)
                                                                                                     E

                                                                     The expectation operator is linear, so E(ax1+bx2) = aE(x1)+

Bernoulli Random Variables                                           bE(x2). In general, the expectation of a function f (X) is


Let E = {0, 1}, P (X = 1) = λ, and P (X = 0) = 1 − λ.                                   E[f (X)] =       f (xi) p(xi)
                                                                                                     E
We now introduce the set indicator variable. (This is a very
useful notation.)
                                                                    Mean: µ        E(X)
                       1 if         w ∈ A;                          Variance: σ 2        E[(X − µ)2]
              IA(w) =
                       0 otherwise.

Using this convention, the probability distribution of a Bernoulli
random variable reads:

                  p(x) = λI{1}(x)(1 − λ)I{0}(x).
CPSC-540: Machine Learning                  37   CPSC-540: Machine Learning                                38


    For the set indicator variable IA(ω),        Continuous Random Variables


                       E[IA(ω)] =                A continuous r.v. is a map to a continuous space, X(w) :
                                                 Ω → R, under the usual measurability conditions. The cu-
                                                 mulative distribution function F (x) (cdf) is defined
                                                 by
                                                                              x

                                                                F (x)             p(y) dy = P (X ≤ x)
                                                                         −∞

                                                 where p(x) denotes the probability density function
                                                 (pdf). For an infinitesimal measure dx in the real line, dis-
                                                 tributions F and densities p are related as follows:

                                                                 F (dx) = p(x)dx = P (X ∈ dx).
CPSC-540: Machine Learning                                39   CPSC-540: Machine Learning                                     40


Univariate Gaussian Distribution                               Univariate Uniform Distribution

The pdf of a Gaussian distribution is given by                 A random variable X with a uniform distribution between
                                    − 1 (x−µ)2
                             √ 1 e 2σ 2
                                                               0 to 1 is written as X ∼ U[0,1](x)
                    p(x) =                     .
                              2πσ 2




                                                               Multivariate Distributions

                                                               Let f (u, v) be a pdf in 2-D. The cdf is defined by
                                                                                 x   y
                                                                   F (x, y) =            f (u, v) du dv = P (X ≤ x, Y ≤ y).
                                                                                −∞ −∞


Our short notation for Gaussian variables is X ∼ N (µ, σ 2).
CPSC-540: Machine Learning                                   41   CPSC-540: Machine Learning                                42


1    Bivariate Uniform Distribution                               where                                            
                                                                                                 µ1           E(x1)
                                                                                                  
                                                                                                  
                                                                                       µ= : = : 
                                                                                                  
                             X ∼ U[0,1]2 (x)                                              µn    E(xn)
                                                                  and
                                                                                                   
                                                                                    σ11 · · · σ1n
                                                                                           
                                                                                           
                                                                           Σ=    ···        = E[(X − µ)(X − µ)T ]
                                                                                           
                                                                              σn1 · · · σnn

                                                                  with σij = E[Xi − µi)(Xj − µj )T ].
                                                                    We can interpret each component of x, for example, as
                                                                  a feature of an image such as colour or texture. The term
Multivariate Gaussian Distribution
                                                                  1      T −1
                                                                  2 (x−µ) Σ (x−µ)          is called the Mahalanobis distance.
Let x ∈ Rn. The pdf of an n-dimensional Gaussian is given         Conceptually, it measures the distance between x and µ.
by
                              1         1      T Σ−1 (x−µ)
             p(x) =                  e− 2 (x−µ)
                      2π n/2|Σ|1/2
CPSC-540: Machine Learning                          43   CPSC-540: Machine Learning   44


                         1      T Σ−1 (x−µ)
    What is      ··· e− 2 (x−µ)               dx?
                                                                     µy = E(Y ) =



                                                                     Σy =




Linear Operations

Let A ∈ Rk×n, b ∈ Rk be given matrices, and X ∈ Rn
be a random variable with mean E(X) = µx ∈ Rn and
covariance cov(X) = ΣX ∈ Rn×n. We define a new random
variable
                             Y = AX + b

If X ∼ N (µx, Σx), then Y ∼ N (µy , Σy ) where
CPSC-540: Machine Learning                      45   CPSC-540: Machine Learning                                  46


Finally, we define the cross-covariance as            Lecture 6 - Linear Supervised Learn-
                ΣXY = E[(X − µX )(Y − µY ) ].        ing
X and Y are uncorrelated if ΣXY = 0. So,             OBJECTIVE: Linear regression is a supervised learn-
                               
                                                     ing task. It is of great interest because:
                      ΣXX 0
                Σ=             .
                       0 ΣY Y                          • Many real processes can be approximated with linear
                                                         models.

                                                       • Linear regression appears as part of larger problems.

                                                       • It can be solved analytically.

                                                       • It illustrates many of the ideas in machine learning.
CPSC-540: Machine Learning                                  47   CPSC-540: Machine Learning                             48


Given the data {x1:n, y1:n}, with xi ∈ Rd and yi ∈ R, we         In matrix form, this expression is
want to fit a hyper-plane that maps x to y.
                                                                                              Y = Xθ

                                                                                                    
                                                                                  y     x · · · x1d      θ
                                                                                  1   10             0
                                                                                 .  .
                                                                                   .= .     .    .  . 
                                                                                 .      .    .
                                                                                              .    .  . 
                                                                                                   .      .
                                                                                                    
                                                                                  yn    xn0 · · · xnd    θd


                                                                 If we have several outputs yi ∈ Rc, our linear regression
                                                                 expression becomes:



Mathematically, the linear model is expressed as follows:

                                          d
                       y i = θ0 +              xij θj
                                         j=1

We let xi,0 = 1 to obtain
                                     d
                             yi =         xij θj                 We will present several approaches for computing θ.
                                    j=0
CPSC-540: Machine Learning                           49   CPSC-540: Machine Learning                               50


3 OPTIMIZATION APPROACH                                     We will need the following result from matrix differentia-
                                                                  ∂A
                                                          tion:   ∂θ   = AT .
Our aim is to mininimise the quadratic cost between the
output labels and the model predictions                                   ∂C
                                                                             =
                                                                          ∂θ
                 C(θ) = (Y − Xθ)T (Y − Xθ)




                                                            These are the normal equations. The solution (esti-
                                                            mate) is:
                                                                                θ=

                                                            The corresponding predictions are

                                                                           Y = HY =

                                                            where H is the “hat” matrix.
CPSC-540: Machine Learning                51   CPSC-540: Machine Learning   52


3 GEOMETRIC APPROACH                           Maximum Likelihood




                         X T (Y − Y ) =
CPSC-540: Machine Learning                                   53   CPSC-540: Machine Learning   54


If our errors are Gaussian distributed, we can use the model      The ML estimate of θ is:

                      Y = Xθ + N (0, σ 2I)

Note that the mean of Y is Xθ and that its variance is
σ 2I. So we can equivalently write this expression using the
probability density of Y given X, θ and σ:


                               −n/2 − 12 (Y −Xθ)T (Y −Xθ)
       p(Y |X, θ, σ) = 2πσ 2       e 2σ



The maximum likelihood (ML) estimate of θ is obtained by
taking the derivative of the log-likelihood, log p(Y |X, θ, σ).
The idea of maximum likelihood learning is to maximise the
likelihood of seeing some data Y by modifying the parame-
ters (θ, σ).
CPSC-540: Machine Learning                             55   CPSC-540: Machine Learning                                    56


Proceeding in the same way, the ML estimate of σ is:        Lecture 7 - Ridge Regression
                                                            OBJECTIVE: Here we learn a cost function for linear
                                                            supervised learning that is more stable than the one in the
                                                            previous lecture. We also introduce the very important no-
                                                            tion of regularization.
                                                              All the answers so far are of the form

                                                                                   θ = (XX T )−1X T Y

                                                            They require the inversion of XX T . This can lead to prob-
                                                            lems if the system of equations is poorly conditioned. A
                                                            solution is to add a small element to the diagonal:

                                                                               θ = (XX T + δ 2Id)−1X T Y

                                                            This is the ridge regression estimate. It is the solution to the
                                                            following regularised quadratic cost function

                                                                        C(θ) = (Y − Xθ)T (Y − Xθ) + δ 2θT θ
CPSC-540: Machine Learning                                 57   CPSC-540: Machine Learning                               58


    Proof:




                                                                That is, we are solving the following constrained opti-
                                                                misation problem:

                                                                                  min         (Y − Xθ)T (Y − Xθ)
It is useful to visualise the quadratic optimisation function                  θ : θT θ ≤ t

and the constraint region.                                      Large values of θ are penalised. We are shrinking θ towards
                                                                zero. This can be used to carry out feature weighting.
                                                                An input xi,d weighted by a small θd will have less
                                                                influence on the ouptut yi.
CPSC-540: Machine Learning                             59   CPSC-540: Machine Learning                                     60


Spectral View of LS and Ridge Regression                        Likewise, for ridge regression we have:
                                                                              d
Again, let X ∈ Rn×d be factored as                                                   σi2
                                                                  Yridge =               ui uT Y
                                                                                σ2
                                                                             i=1 i
                                                                                     + δ2 i
                                      d
                   X = U ΣV T =            uiσiviT ,
                                     i=1

where we have assumed that the rank of X is d.

    The least squares prediction is:
                     d
            YLS =         ui u T Y
                               i
                    i=1




                                                            The filter factor
                                                                                                      σi2
                                                                                         fi =
                                                                                                σi2   + δ2
                                                            penalises small values of σ 2 (they go to zero at a faster rate).
CPSC-540: Machine Learning                               61   CPSC-540: Machine Learning                                  62


                                                              Small eigenvectors tend to be wobbly. The Ridge filter fac-
                                                              tor fi gets rid of the wobbly eigenvectors. Therefore, the
                                                              predictions tend to be more stable (smooth, regularised).

                                                              The smoothness parameter δ 2 is often estimated by cross-
                                                              validation or Bayesian hierarchical methods.


                                                              Minimax and cross-validation

                                                              Cross-validation is a widely used technique for choosing δ.
Also, by increasing δ 2 we are penalising the weights:
                                                              Here’s an example:
CPSC-540: Machine Learning                                 63   CPSC-540: Machine Learning                           64


Lecture 8 - Maximum Likelihood
                                                                                     θ = arg max p(x1:n|θ)
and Bayesian Learning                                                                        θ

                                                                  For identical and independent distributed
OBJECTIVE: In this chapter, we revise maximum like-               (i.i.d.) data:
lihood (ML) for a simple binary model. We then introduce
Bayesian learning for this simple model and for the linear-
Gaussian regression setting of the previous chapters. The key                      p(x1:n|θ) =
difference between the two approaches is that the frequentist
view assumes there is one true model responsible for the ob-
servations, while the Bayesian view assumes that the model
                                                                           L(θ) = log p(x1:n|θ) =
is a random variable with a certain prior distribution. Com-
putationally, the ML problem is one of optimization, while
Bayesian learning is one of integration.


3 MAXIMUM LIKELIHOOD
                                                                Let’s illustrate this with a coin-tossing example.
  Frequentist Learning assumes that there is a true model
(say a parametric model with parameters θ0). The estimate
is denoted θ. It can be found by maximising the likelihood:
CPSC-540: Machine Learning                             65   CPSC-540: Machine Learning                             66


    Let x1:n, with xi ∈ {0, 1}, be i.i.d. Bernoulli:        3 BAYESIAN LEARNING
                                                            Given our prior knowledge p(θ) and the data model p(·|θ),
                         n
          p(x1:n|θ) =         p(xi|θ)                       the Bayesian approach allows us to update our prior using
                        i=1
                                                            the new data x1:n as follows:

                                                                                              p(x1:n|θ)p(θ)
                                                                                p(θ|x1:n) =
                                                                                                 p(x1:n)

                                                            where p(θ|x1:n) is the posterior distribution, p(x1:n|θ)
                                                            is the likelihood and p(x1:n) is the marginal likelihood
                                                            (evidence). Note
  With m          xi, we have
                                                                              p(x1:n) =       p(x1:n|θ)p(θ)dθ

               L(θ) =

  Differentiating, we get
CPSC-540: Machine Learning                               67      CPSC-540: Machine Learning                                  68


Bayesian Prediction                                                  Let x1:n, with xi ∈ {0, 1}, be i.i.d. Bernoulli: xi ∼
                                                                   B(1, θ)
We predict by marginalising over the posterior of the para-
meters                                                                           n
                                                                  p(x1:n|θ) =         p(xi|θ) = θm(1 − θ)n−m
                                                                                i=1
           p(xn+1|x1:n) =         p(xn+1, θ|x1:n)dθ
                                                                   Let us choose the following Beta prior distribution:
                             =    p(xn+1|θ)p(θ|x1:n)dθ                                   Γ(α)Γ(β) α−1
                                                                                p(θ) =            θ (1 − θ)β−1
                                                                                         Γ(α + β)
Bayesian Model Selection
                                                                   where Γ denotes the Gamma-function. For the time

For a particular model structure Mi, we have                       being, α and β are fixed hyper-parameters. The
                                                                   posterior distribution is proportional to:
                                 p(x1:n|θ, Mi)p(θ|Mi)
             p(θ|x1:n, Mi) =
                                      p(x1:n|Mi)
                                                                             p(θ|x) ∝
Models are selected according to their posterior:

P (Mi|x1:n) ∝ P (x1:n|Mi)p(Mi) = P (Mi) p(x1:n|θ, Mi)p(θ|Mi)dθ

The ratio P (x1:n|Mi)/P (x1:n|Mj ) is known as the Bayes           with normalisation constant
Factor.
CPSC-540: Machine Learning                                69   CPSC-540: Machine Learning                                70


Since the posterior is also Beta, we say that the Beta prior   The generalisation of the Beta distribution is the Dirichlet
is conjugate with respect to the binomial likelihood. Con-     distribution D(αi), with density
jugate priors lead to the same form of posterior.                                                k
                                                                                        p(θ) ∝         θi i−1
                                                                                                        α

Different hyper-parameters of the Beta Be(α, β) distribution                                      i=1

give rise to different prior specifications:                     where we have assumed k possible thetas. Note that the
                                                               Dirichlet distribution is conjugate with respect
                                                               to a Multinomial likelihood.
CPSC-540: Machine Learning                                               71       CPSC-540: Machine Learning                                               72


3 BAYESIAN LEARNING FOR LINEAR-GAUSSIAN MOD-
ELS                                                                                                            −1   − 12 (θ−µ)T M −1 (θ−µ)
                                                                                    p(θ|X, Y ) = 2πσ 2M         2
                                                                                                                    e2σ
In the Bayesian linear prediction setting, we focus on com-
                                                                                                 −n   − 12 (Y −Xθ)T (Y −Xθ)              −d      1
                                                                                                                                              − 2 2 θT θ
                                                                                     ∝ 2πσ 2      2
                                                                                                      e2σ                     2πσ 2δ 2    2
                                                                                                                                              e2δ σ
puting the posterior:

      p(θ|X, Y ) ∝ p(Y |X, θ)p(θ)
                                 −n   − 12 (Y −Xθ)T (Y −Xθ)
                   =     2πσ 2    2
                                      e2σ                   p(θ)

We often want to maximise the posterior — that is, we look
for the maximum a poteriori (MAP) estimate. In this case,
the choice of prior determines a type of constraint! For ex-
ample, consider a Gaussian prior θ ∼ N (0, δ 2σ 2Id). Then


                        −n   − 12 (Y −Xθ)T (Y −Xθ)              −d      1
                                                                     − 2 2 θT θ
p(θ|X, Y ) ∝ 2πσ 2       2
                             e2σ                     2πσ 2δ 2    2
                                                                     e2δ σ



Our task is to rearrange terms in the exponents in order to
obtain a simple expression for the posterior distribution.
CPSC-540: Machine Learning                                          73   CPSC-540: Machine Learning                                            74


So the posterior for θ is Gaussian:                                      2     Full Bayesian Model

                                      −1   − 12 (θ−µ)T M −1 (θ−µ)
       p(θ|X, Y ) = 2πσ 2M             2
                                           e2σ                           In Bayesian inference, we’re interested in the full posterior:

with sufficient statistics:                                                    p(θ, σ 2, δ 2|X, Y ) ∝ p(Y |θ, σ 2, X)p(θ|σ 2, δ 2)p(σ 2)p(δ 2)

             E(θ|X, Y ) = (XX T + δ −2Id)−1X T Y                         where
          var(θ|X, Y ) = (XX T + δ −2Id)−1σ 2
                                                                                          Y |θ, σ 2, X ∼ N (Xθ, σ 2In)
The MAP point estimate is:                                                                             θ ∼ N 0, (σ 2δ 2Id)
                                                                                                      σ 2 ∼ IG (a/2, b/2)
              θM AP = (XX T + δ −2Id)−1X T Y
                                                                                                      δ 2 ∼ IG (α, β)
It is the same as the ridge estimate (except for a trivial
                                                                         where IG (α, β) denotes the Inverse-Gamma distribu-
negative sign in the exponent of δ), which results from the
                                                                         tion.
L2 constraint. A flat (“vague”) prior with large variance
                                                                                                         β α −β/δ2 2 −α−1
(large δ) leads to the ML estimate.                                              δ 2 ∼ IG (α, β) =           e    (δ )    I[0,∞)(δ 2)
                                                                                                        Γ(α)
                             δ 2 →0
         θM AP = θridge      −→        θM L = θSV D = θLS                    This is the conjugate prior for the variance of a
                                                                         Gaussian. The generalization of the Gamma distribution,
                                                                         i.e. the conjugate prior of a covariance matrix is the inverse
CPSC-540: Machine Learning                                75   CPSC-540: Machine Learning                                           76


Wishart distribution Σ ∼ IWd(α, αΣ∗), admitting the            The product of likelihood and priors is:
density                                                                                                −n   − 12 (Y −Xθ)T (Y −Xθ)
                                                                   p(θ, σ 2, δ 2|X, Y ) ∝    2πσ 2      2
                                                                                                            e2σ

                                                                                             −d      1
    p(Σ|α, Σ∗) ∝ |Σ|−(α+d+1)/2 exp{−(1/2)tr(αΣ∗Σ−1)}                            × 2πσ 2δ 2    2   − 2 2 θT θ
                                                                                                  e2δ σ

                                                                                               − b 2 2 −α−1 − β
                                                                                × (σ 2)−a/2−1e  2σ (δ )    e δ2
  We can visualise our hierarchical model with the following
graphical model:                                               We know from our previous work on computing the posterior
                                                               for θ that:

                                                                                                                − 12 Y T P Y
                                                                       p(θ, σ 2, δ 2|X, Y ) ∝ (2πσ 2)−n/2e       2σ

                                                                                                        − 12 (θ−µ)T M −1 (θ−µ)
                                                                                   × (2πσ 2δ 2)−d/2e     2σ

                                                                                                      − b 2 2 −α−1 − β
                                                                                   × (σ 2)−a/2−1e      2σ (δ )    e δ2

                                                               where

                                                                                    M −1 = X T X + δ −2Id
                                                                                        µ = M XT Y
                                                                                       P = In − XM X T
CPSC-540: Machine Learning                                  77   CPSC-540: Machine Learning                                     78


From this expression, it is now obvious that                     Integrating over σ 2 gives us an expression for p(δ 2|X, Y )

                  p(θ|σ 2, X, Y ) = N (µ, σ 2M )

Next, we integrate p(θ, σ 2, δ 2|X, Y ) over θ in order to get
an expression for p(σ 2, δ 2|X, Y ). This will allow us to get
an expression for the marginal posterior p(σ 2|X, Y ).




                                                                 Unfortunately this is a nonstandard distribution, thus mak-
                                                                 ing it hard for us to come up with the normalizing constant.
                                 a + n b + Y PY
           p(σ 2|X, Y ) ∼ IG          ,
                                   2       2                     So we’ll make use of the fact that we know θ and σ 2 to derive
CPSC-540: Machine Learning                                          79   CPSC-540: Machine Learning                             80


a conditional distribution p(δ 2|θ, σ 2, X, Y ).                         In summary, we can

                                                                           • Obtain p(θ|σ 2, δ 2, X, Y ) analytically.
    We know that:

                                       −n   − 12 (Y −Xθ)T (Y −Xθ)
                                                                           • Obtain p(σ 2|δ 2, X, Y ) analytically.
    p(θ, σ 2, δ 2|X, Y ) ∝    2πσ 2     2
                                            e   2σ

                              −d      1
                                   − 2 2 θT θ                              • Derive an expression for p(δ 2|θ, σ 2, X, Y ).
                 × 2πσ 2δ 2    2
                                   e2δ σ

                                − b 2 2 −α−1 − β
                 × (σ 2)−a/2−1e  2σ (δ )    e δ2                         Given δ 2, we can obtain analytical expressions for θ and
                                                                         σ 2. But, let’s be a bit more ambitious. Imagine we could
                                                                         run the following sampling algorithm (known as the Gibbs
                                                                         Sampler)
CPSC-540: Machine Learning                                      81   CPSC-540: Machine Learning                                             82


                                                                     For example, the predictive distribution

   1. LOAD data (X, Y ).                                             p(yn+1|X1:n+1, Y ) =         p(yn+1|θ, σ 2, xn+1)p(θ, σ 2, δ 2|X, Y )dθdσ 2dδ 2
   2. Compute X T Y and X T X.
                                                                     can be approximated with:
   3. Set, e.g., a = b = 0, α = 2 and β = 10.                                                            N
                                                                                                     1
                                                                          p(yn+1|X1:n+1, Y ) =                 p(yn+1|θ(i), σ 2(i), xn+1)
   4. Sample δ 2(0) ∼ IG (α, β).                                                                     N   i=1

   5. FOR i = 1 to N :                                               That is,

      (a) Compute M , P and µ(i) using δ 2(i−1).
                                                                       p(yn+1|X1:n+1, Y ) =
      (b) Sample σ 2(i) ∼ IG     a+n b+Y P Y
                                  2 ,   2         .

       (c) Sample θ(i) ∼ N (µ(i), σ 2(i)M ).
                                                 (i)T (i)
      (d) Sample δ 2(i) ∼ IG     d
                                 2   + α, β + θ 2σ2(i)
                                                   θ
                                                            .



We can use these samples in order to approximate the inte-
                                                                       In the next lecture, we will derive the theory that justifies
grals of interest with Monte Carlo averages.
                                                                     the use of this algorithm as well a many other Monte Carlo
                                                                     algorithms.

More Related Content

What's hot

(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
zukun
 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforce
chidabdu
 
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
Kensuke Mitsuzawa
 

What's hot (20)

(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Pointing the Unknown Words
Pointing the Unknown WordsPointing the Unknown Words
Pointing the Unknown Words
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learning
 
MATHEON Center Days: Index determination and structural analysis using Algori...
MATHEON Center Days: Index determination and structural analysis using Algori...MATHEON Center Days: Index determination and structural analysis using Algori...
MATHEON Center Days: Index determination and structural analysis using Algori...
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
Some Two-Steps Discrete-Time Anticipatory Models with ‘Boiling’ Multivaluedness
Some Two-Steps Discrete-Time Anticipatory Models with ‘Boiling’ MultivaluednessSome Two-Steps Discrete-Time Anticipatory Models with ‘Boiling’ Multivaluedness
Some Two-Steps Discrete-Time Anticipatory Models with ‘Boiling’ Multivaluedness
 
Queues internet src2
Queues internet  src2Queues internet  src2
Queues internet src2
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
新たなRNNと自然言語処理
新たなRNNと自然言語処理新たなRNNと自然言語処理
新たなRNNと自然言語処理
 
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
cvpr2009 tutorial: kernel methods in computer vision: part II: Statistics and...
 
505 260-266
505 260-266505 260-266
505 260-266
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering
 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforce
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
By BIRASA FABRICE
By BIRASA FABRICEBy BIRASA FABRICE
By BIRASA FABRICE
 
4 report format
4 report format4 report format
4 report format
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Curve fitting
Curve fittingCurve fitting
Curve fitting
 
Quantum Algorithms and Lower Bounds in Continuous Time
Quantum Algorithms and Lower Bounds in Continuous TimeQuantum Algorithms and Lower Bounds in Continuous Time
Quantum Algorithms and Lower Bounds in Continuous Time
 
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...slides for "Supervised Model Learning with Feature Grouping based on a Discre...
slides for "Supervised Model Learning with Feature Grouping based on a Discre...
 

Viewers also liked

Spector, Alfred Z (Google) The Continuing Metamorphosis of ...
Spector, Alfred Z (Google) The Continuing Metamorphosis of ...Spector, Alfred Z (Google) The Continuing Metamorphosis of ...
Spector, Alfred Z (Google) The Continuing Metamorphosis of ...
butest
 
SMACS Research
SMACS ResearchSMACS Research
SMACS Research
butest
 
The Need for Open Source Software in Machine Learning
The Need for Open Source Software in Machine LearningThe Need for Open Source Software in Machine Learning
The Need for Open Source Software in Machine Learning
butest
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...
butest
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
helbredte
helbredtehelbredte
helbredte
butest
 
Open06
Open06Open06
Open06
butest
 
633-600 Machine Learning
633-600 Machine Learning633-600 Machine Learning
633-600 Machine Learning
butest
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
butest
 
SHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.docSHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.doc
butest
 
WEB PAGE DESIGN
WEB PAGE DESIGNWEB PAGE DESIGN
WEB PAGE DESIGN
butest
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
butest
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
petitegeek
 
Attribute Interactions in Machine Learning
Attribute Interactions in Machine LearningAttribute Interactions in Machine Learning
Attribute Interactions in Machine Learning
butest
 

Viewers also liked (14)

Spector, Alfred Z (Google) The Continuing Metamorphosis of ...
Spector, Alfred Z (Google) The Continuing Metamorphosis of ...Spector, Alfred Z (Google) The Continuing Metamorphosis of ...
Spector, Alfred Z (Google) The Continuing Metamorphosis of ...
 
SMACS Research
SMACS ResearchSMACS Research
SMACS Research
 
The Need for Open Source Software in Machine Learning
The Need for Open Source Software in Machine LearningThe Need for Open Source Software in Machine Learning
The Need for Open Source Software in Machine Learning
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
helbredte
helbredtehelbredte
helbredte
 
Open06
Open06Open06
Open06
 
633-600 Machine Learning
633-600 Machine Learning633-600 Machine Learning
633-600 Machine Learning
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
SHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.docSHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.doc
 
WEB PAGE DESIGN
WEB PAGE DESIGNWEB PAGE DESIGN
WEB PAGE DESIGN
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Attribute Interactions in Machine Learning
Attribute Interactions in Machine LearningAttribute Interactions in Machine Learning
Attribute Interactions in Machine Learning
 

Similar to Machine Learning

Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
NYversity
 
HW2-1_05.doc
HW2-1_05.docHW2-1_05.doc
HW2-1_05.doc
butest
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
MayuraD1
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
dickonsondorris
 

Similar to Machine Learning (20)

A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
 
Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
 
Data Analysis Homework Help
Data Analysis Homework HelpData Analysis Homework Help
Data Analysis Homework Help
 
Linear algebra havard university
Linear algebra havard universityLinear algebra havard university
Linear algebra havard university
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral Clustering
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
HW2-1_05.doc
HW2-1_05.docHW2-1_05.doc
HW2-1_05.doc
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets Deconstructed
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Preemptive RANSAC by David Nister.
Preemptive RANSAC by David Nister.Preemptive RANSAC by David Nister.
Preemptive RANSAC by David Nister.
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
 
Deep learning (2)
Deep learning (2)Deep learning (2)
Deep learning (2)
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
Parallelising Dynamic Programming
Parallelising Dynamic ProgrammingParallelising Dynamic Programming
Parallelising Dynamic Programming
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning

  • 1. CPSC-540: Machine Learning 1 CPSC-540: Machine Learning 2 . Lecture 2 - Google’s PageRank: Why math helps OBJECTIVE: Motivate linear algebra and probability as important and necessary tools for understanding large datasets. We also describe the algorithm at the core of the Machine Learning Google search engine. Nando de Freitas 3 PAGERANK January 16, 2007 Consider the following mini-web of 3 pages (the data): 0.1 x2 1 0.9 0.4 x1 x3 0.6 The nodes are the webpages and the arrows are links. The
  • 2. CPSC-540: Machine Learning 3 CPSC-540: Machine Learning 4 numbers are the normalised number of links. We can re-write N = 100, iterations we get: this directed graph as a transition matrix: π T T N = (0.2, 0.4, 0.4) We soon notice that no matter what initial π we choose, we always converge to p = (0.2, 0.4, 0.4)T . So pT T = T is a stochastic matrix: its columns add up to 1, so that Ti,j = P (xj |xi) Ti,j = 1 j In information retrieval, we want to know the “relevance” of each webpage. That is, we want to compute the probability of each webpage: p(xi) for i = 1, 2, 3. Let’s start with a random guess π = (0.5, 0.2, 0.3)T and “crawl the web” (multiply by T several times). After, say
  • 3. CPSC-540: Machine Learning 5 CPSC-540: Machine Learning 6 The distribution p is a measure of the relevance of each page. That is, the matrix T cannot be reduced to separate Google uses this. But will this work always? When does it smaller matrices, which is also the same as stating that fail? the transition graph is connected. 2. Aperiodicity: The chain should not get trapped in cycles. Google’s strategy is to add am matrix of uniform noise E to T: L=T+ E where is a small number. L is then normalised. This ensures irreducibility. How quickly does this algorithm converge? What determines The Perron-Frobenius Theorem tell us that for any the rate of convergence? Again matrix algebra and spectral starting point, the chain will converge to the invariant dis- theory provide the answers: tribution p, as long as T is a stochastic transition matrix that obeys the following properties: 1. Irreducibility: For any state of the Markov chain, there is a positive probability of visiting all other states.
  • 4. CPSC-540: Machine Learning 7 CPSC-540: Machine Learning 8 Lecture 3 - The Singular Value Decomposition (SVD) OBJECTIVE: The SVD is a matrix factorization that has many applications: e.g., information retrieval, least-squares problems, image processing. 3 EIGENVALUE DECOMPOSITION Let A ∈ Rm×m. If we put the eigenvalues of A into a diagonal matrix Λ and gather the eigenvectors into a matrix X, then the eigenvalue decomposition of A is given by A = XΛX−1. But what if A is not a square matrix? Then the SVD comes to the rescue.
  • 5. CPSC-540: Machine Learning 9 CPSC-540: Machine Learning 10 3 FORMAL DEFINITION OF THE SVD i.e., Given A ∈ Rm×n, the SVD of A is a factorization of the A v1 v2 . . . vn   form σ1   A = UΣVT    σ2  =   u1 u2 . . . un  ...  where u are the left singular vectors, σ are the singular     values and v are the right singular vectors. σn or AV = UΣ. Σ ∈ Rn×n is diagonal with positive entries (singular values in the diagonal). U ∈ Rm×n with orthonormal columns. V ∈ Rn×n with orthonormal columns. (⇒ V is orthogonal so V−1 = VT ) The equations relating the right singular values {vj } and the left singular vectors {uj } are Avj = σj uj j = 1, 2, . . . , n
  • 6. CPSC-540: Machine Learning 11 CPSC-540: Machine Learning 12 1. There is no assumption that m ≥ n or that A has full 3 EIGENVALUE DECOMPOSITION rank. Theorem 2 The nonzero singular values of A are the 2. All diagonal elements of Σ are non-negative and in non- (positive) square roots of the nonzero eigenvalues of AT A increasing order: or AAT (these matrices have the same nonzero eigenval- σ1 ≥ σ2 ≥ . . . ≥ σp ≥ 0 ues). Proof: where p = min (m, n) Theorem 1 Every matrix A ∈ Rm×n has singular value decomposition A = UΣVT Furthermore, the singular values {σj } are uniquely de- termined. If A is square and σi = σj for all i = j, the left singu- lar vectors {uj } and the right singular vectors {vj } are uniquely determined to within a factor of ±1.
  • 7. CPSC-540: Machine Learning 13 CPSC-540: Machine Learning 14 3 LOW-RANK APPROXIMATIONS Another way to understand the SVD is to consider how a Ax matrix may be represented by a sum of rank-one matrices. Theorem 3 A 2 = σ1, where A 2 = maxx=0 x = max Ax . Theorem 4 x =1 r T A= σj uj vj Proof: j=1 where r is the rank of A. Proof:
  • 8. CPSC-540: Machine Learning 15 CPSC-540: Machine Learning 16 What is so useful about this expansion is that the ν th Lecture 4 - Fun with the SVD partial sum captures as much of the “energy” of A as OBJECTIVE: Applications of the SVD to image com- possible by a matrix of at most rank-ν. In this case, “en- pression, dimensionality reduction, visualization, informa- ergy” is defined by the 2-norm. tion retrieval and latent semantic analysis. Theorem 5 For any ν with 0 ≤ ν ≤ r define ν T 3 IMAGE COMPRESSION EXAMPLE Aν = σj uj vj j=1 load clown.mat; If ν = p = min(m, n), define σν+1 = 0. figure(1) Then, colormap(’gray’) A − Aν 2 = σν+1 image(A); [U,S,V] = svd(A); figure(2) k = 20; colormap(’gray’) image(U(:,1:k)*S(1:k,1:k)*V(:,1:k)’);
  • 9. CPSC-540: Machine Learning 17 CPSC-540: Machine Learning 18 The code loads a clown image into a 200 × 320 array A; u1σ1v1 u2σ2v2 displays the image in one figure; performs a singular value decomposition on A; and displays the image obtained from a rank-20 SVD approximation of A in another figure. Results u3σ3v3 u4σ4v4 1+2+3+4 are displayed below: 20 20 40 40 60 60 u σ v u σ v 1+2+3+4+5+6 80 80 5 5 5 6 6 6 100 100 120 120 140 140 160 160 180 180 200 200 50 100 150 200 250 300 50 100 150 200 250 300 The original storage requirements for A are 200 · 320 = Smaller eigenvectors capture high frequency variations (small 64, 000, whereas the compressed representation requires (200+ brush-strokes). 300 + 1) · 20 ≈ 10, 000 storage locations.
  • 10. CPSC-540: Machine Learning 19 CPSC-540: Machine Learning 20 3 TEXT RETRIEVAL - LSI If we truncate the approximation to the k-largest singular The SVD can be used to cluster documents and carry values, we have T out information retrieval by using concepts as opposed to A = Uk Σk Vk word-matching. This enables us to surmount the problems So of synonymy (car,auto) and polysemy (money bank, river Vk = Σ−1UT A T k k bank). The data is available in a term-frequency matrix
  • 11. CPSC-540: Machine Learning 21 CPSC-540: Machine Learning 22 In English, A is projected to a lower-dimensional space 3 PRINCIPAL COMPONENT ANALYSIS (PCA) spanned by the k singular vectors Uk (eigenvectors of AAT ). The columns of UΣ are called the principal compo- To carry out retrieval, a query q ∈ R is first projected n nents of A. We can project high-dimensional data to these to the low-dimensional space: components in order to be able to visualize it. This idea is also useful for cleaning data as discussed in the previous text qk = Σ−1UT q k k retrieval example. And then we measure the angle between qk and the vk .
  • 12. CPSC-540: Machine Learning 23 CPSC-540: Machine Learning 24 For example, we can take several 16 × 16 images of the 8 digit 2 and project them to 2D. The images can be written 6 as vectors with 256 entries. We then from the matrix A ∈ 4 2 Rn×256, carry out the SVD and truncate it to k = 2. Then 0 the components Uk Σk are 2 vectors with n data entries. We −2 can plot these 2D points on the screen to visualize the data. −4 −6 −8 −10 −12 −11 −10 −9 −8 −7 −6 −5 −4 20 40 60 80 100 120 140 50 100 150 200 250
  • 13. CPSC-540: Machine Learning 25 CPSC-540: Machine Learning 26 Lecture 5 - Probability Revision Why do we need measure? OBJECTIVE: Revise the fundamental concepts of prob- ability, including marginalization, conditioning, Bayes rule and expectation. 3 PROBABILITY Probability theory is the formal study of the laws of chance. It is our tool for dealing with uncertainty. Notation: Frequentist Perspective • Sample space: is the set Ω of all outcomes of an experiment. Let probability be the frequency of events. • Outcome: what we observed. We use ω ∈ Ω to denote a particular outcome. e.g. for a die we have Ω = {1, 2, 3, 4, 5, 6} and ω could be any of these six numbers. • Event: is a subset of Ω that is well defined (measur- able). e.g. the event A = {even} if w ∈ {2, 4, 6}
  • 14. CPSC-540: Machine Learning 27 CPSC-540: Machine Learning 28 Axiomatic Perspective 1. P (∅) = 0 ≤ p(A) ≤ 1 = P (Ω) The frequentist interpretation has some shortcomings when 2. For disjoint sets An, n ≥ 1, we have ∞ ∞ we ask ourselves questions like P An = P (An) • what is the probability that David will sleep with Anne? n=1 n=1 • What is the probability that the Panama Canal is longer than the Suez Canal? The axiomatic view is a more elegant mathematical solu- tion. Here, a probabilistic model consists of the triple If the sets overlap: (Ω, F, P ), where Ω is the sample space, F is the sigma-field (collection of measurable events) and P is a function map- ping F to the interval [0, 1]. That is, with each event A ∈ F we associate a probability P (A). Some outcomes are not measurable so we have to assign probabilities to F and not Ω. Fortunately, in this course everything will be measurable so we need no concern our- selves with measure theory. We do have to make sure the P (A + B) = P (A) + P (B) − P (AB) following two axioms apply:
  • 15. CPSC-540: Machine Learning 29 CPSC-540: Machine Learning 30 If the events A and B are independent, we have P (AB) = Conditional Probability P (A)P (B). P (AB) Let P (HIV ) = 1/500 be the probability of contract- P (A|B) P (B) ing HIV by having unprotected sex. If one has unpro- where P (A|B) is the conditional probability of A given tected sex twice, the probability of contracting HIV be- that B occurs, P (B) is the marginal probability of B comes: and P (AB) is the joint probability of A and B. In general, we obtain a chain rule P (A1:n) = P (An|A1:n−1)P (An−1|A1:n−2) . . . P (A2|A1)P (A1) What if we have unprotected sex 500 times?
  • 16. CPSC-540: Machine Learning 31 CPSC-540: Machine Learning 32 Assume we have an urn with 3 red balls and 1 blue Proof: ball: U = {r, r, r, b}. What is the probability of drawing (without replacement) 2 red balls in the first 2 tries? What is the probability that the second ball drawn from our urn will be red? Marginalisation n Let the sets B1:n be disjoint and i=1 Bi = Ω. Then n P (A) = P (A, Bi) i=1 Bayes Rule Bayes rule allows us to reverse probabilities: P (B|A)P (A) P (A|B) = P (B)
  • 17. CPSC-540: Machine Learning 33 CPSC-540: Machine Learning 34 Combinining this with marginalisation, we obtain a powerful such that for all x ∈ E we have {w|X(w) ≤ x} ∈ F. Since tool for statistical modelling: F denotes the measurable sets, this condition simply says that we can compute (measure) the probability P (X = x). P (data|modeli)P (modeli) P (modeli|data) = M Assume we are throwing a die and are interested in j=1 P (data|modelj )P (modelj ) the events E = {even, odd}. Here Ω = {1, 2, 3, 4, 5, 6}. The r.v. takes the value X(w) = even if w ∈ {2, 4, 6} That is, if we have prior probabilities for each model and and X(w) = odd if w ∈ {1, 3, 5}. We describe this r.v. generative data models, we can compute how likely each with a probability distribution p(xi) = P (X = model is a posteriori (in light of our prior knowledge and xi) = 1 , i = 1, . . . , 2 2 the evidence brought in by the data). Discrete random variables Let E be a discrete set, e.g. E = {0, 1}. A discrete The cumulative distribution function is defined as random variable (r.v.) is a map from Ω to E: F (x) = P (X ≤ x) and would for this example be: X(w) : Ω → E
  • 18. CPSC-540: Machine Learning 35 CPSC-540: Machine Learning 36 Expectation of Discrete Random Variables The expectation of a discrete random variable X is E[X] = xip(xi) E The expectation operator is linear, so E(ax1+bx2) = aE(x1)+ Bernoulli Random Variables bE(x2). In general, the expectation of a function f (X) is Let E = {0, 1}, P (X = 1) = λ, and P (X = 0) = 1 − λ. E[f (X)] = f (xi) p(xi) E We now introduce the set indicator variable. (This is a very useful notation.)  Mean: µ E(X)  1 if w ∈ A; Variance: σ 2 E[(X − µ)2] IA(w) =  0 otherwise. Using this convention, the probability distribution of a Bernoulli random variable reads: p(x) = λI{1}(x)(1 − λ)I{0}(x).
  • 19. CPSC-540: Machine Learning 37 CPSC-540: Machine Learning 38 For the set indicator variable IA(ω), Continuous Random Variables E[IA(ω)] = A continuous r.v. is a map to a continuous space, X(w) : Ω → R, under the usual measurability conditions. The cu- mulative distribution function F (x) (cdf) is defined by x F (x) p(y) dy = P (X ≤ x) −∞ where p(x) denotes the probability density function (pdf). For an infinitesimal measure dx in the real line, dis- tributions F and densities p are related as follows: F (dx) = p(x)dx = P (X ∈ dx).
  • 20. CPSC-540: Machine Learning 39 CPSC-540: Machine Learning 40 Univariate Gaussian Distribution Univariate Uniform Distribution The pdf of a Gaussian distribution is given by A random variable X with a uniform distribution between − 1 (x−µ)2 √ 1 e 2σ 2 0 to 1 is written as X ∼ U[0,1](x) p(x) = . 2πσ 2 Multivariate Distributions Let f (u, v) be a pdf in 2-D. The cdf is defined by x y F (x, y) = f (u, v) du dv = P (X ≤ x, Y ≤ y). −∞ −∞ Our short notation for Gaussian variables is X ∼ N (µ, σ 2).
  • 21. CPSC-540: Machine Learning 41 CPSC-540: Machine Learning 42 1 Bivariate Uniform Distribution where     µ1 E(x1)         µ= : = :      X ∼ U[0,1]2 (x) µn E(xn) and   σ11 · · · σ1n     Σ= ···  = E[(X − µ)(X − µ)T ]   σn1 · · · σnn with σij = E[Xi − µi)(Xj − µj )T ]. We can interpret each component of x, for example, as a feature of an image such as colour or texture. The term Multivariate Gaussian Distribution 1 T −1 2 (x−µ) Σ (x−µ) is called the Mahalanobis distance. Let x ∈ Rn. The pdf of an n-dimensional Gaussian is given Conceptually, it measures the distance between x and µ. by 1 1 T Σ−1 (x−µ) p(x) = e− 2 (x−µ) 2π n/2|Σ|1/2
  • 22. CPSC-540: Machine Learning 43 CPSC-540: Machine Learning 44 1 T Σ−1 (x−µ) What is ··· e− 2 (x−µ) dx? µy = E(Y ) = Σy = Linear Operations Let A ∈ Rk×n, b ∈ Rk be given matrices, and X ∈ Rn be a random variable with mean E(X) = µx ∈ Rn and covariance cov(X) = ΣX ∈ Rn×n. We define a new random variable Y = AX + b If X ∼ N (µx, Σx), then Y ∼ N (µy , Σy ) where
  • 23. CPSC-540: Machine Learning 45 CPSC-540: Machine Learning 46 Finally, we define the cross-covariance as Lecture 6 - Linear Supervised Learn- ΣXY = E[(X − µX )(Y − µY ) ]. ing X and Y are uncorrelated if ΣXY = 0. So, OBJECTIVE: Linear regression is a supervised learn-   ing task. It is of great interest because: ΣXX 0 Σ= . 0 ΣY Y • Many real processes can be approximated with linear models. • Linear regression appears as part of larger problems. • It can be solved analytically. • It illustrates many of the ideas in machine learning.
  • 24. CPSC-540: Machine Learning 47 CPSC-540: Machine Learning 48 Given the data {x1:n, y1:n}, with xi ∈ Rd and yi ∈ R, we In matrix form, this expression is want to fit a hyper-plane that maps x to y. Y = Xθ      y x · · · x1d θ  1   10   0 .  . .= . . .  .  . . . . .  .  . .      yn xn0 · · · xnd θd If we have several outputs yi ∈ Rc, our linear regression expression becomes: Mathematically, the linear model is expressed as follows: d y i = θ0 + xij θj j=1 We let xi,0 = 1 to obtain d yi = xij θj We will present several approaches for computing θ. j=0
  • 25. CPSC-540: Machine Learning 49 CPSC-540: Machine Learning 50 3 OPTIMIZATION APPROACH We will need the following result from matrix differentia- ∂A tion: ∂θ = AT . Our aim is to mininimise the quadratic cost between the output labels and the model predictions ∂C = ∂θ C(θ) = (Y − Xθ)T (Y − Xθ) These are the normal equations. The solution (esti- mate) is: θ= The corresponding predictions are Y = HY = where H is the “hat” matrix.
  • 26. CPSC-540: Machine Learning 51 CPSC-540: Machine Learning 52 3 GEOMETRIC APPROACH Maximum Likelihood X T (Y − Y ) =
  • 27. CPSC-540: Machine Learning 53 CPSC-540: Machine Learning 54 If our errors are Gaussian distributed, we can use the model The ML estimate of θ is: Y = Xθ + N (0, σ 2I) Note that the mean of Y is Xθ and that its variance is σ 2I. So we can equivalently write this expression using the probability density of Y given X, θ and σ: −n/2 − 12 (Y −Xθ)T (Y −Xθ) p(Y |X, θ, σ) = 2πσ 2 e 2σ The maximum likelihood (ML) estimate of θ is obtained by taking the derivative of the log-likelihood, log p(Y |X, θ, σ). The idea of maximum likelihood learning is to maximise the likelihood of seeing some data Y by modifying the parame- ters (θ, σ).
  • 28. CPSC-540: Machine Learning 55 CPSC-540: Machine Learning 56 Proceeding in the same way, the ML estimate of σ is: Lecture 7 - Ridge Regression OBJECTIVE: Here we learn a cost function for linear supervised learning that is more stable than the one in the previous lecture. We also introduce the very important no- tion of regularization. All the answers so far are of the form θ = (XX T )−1X T Y They require the inversion of XX T . This can lead to prob- lems if the system of equations is poorly conditioned. A solution is to add a small element to the diagonal: θ = (XX T + δ 2Id)−1X T Y This is the ridge regression estimate. It is the solution to the following regularised quadratic cost function C(θ) = (Y − Xθ)T (Y − Xθ) + δ 2θT θ
  • 29. CPSC-540: Machine Learning 57 CPSC-540: Machine Learning 58 Proof: That is, we are solving the following constrained opti- misation problem: min (Y − Xθ)T (Y − Xθ) It is useful to visualise the quadratic optimisation function θ : θT θ ≤ t and the constraint region. Large values of θ are penalised. We are shrinking θ towards zero. This can be used to carry out feature weighting. An input xi,d weighted by a small θd will have less influence on the ouptut yi.
  • 30. CPSC-540: Machine Learning 59 CPSC-540: Machine Learning 60 Spectral View of LS and Ridge Regression Likewise, for ridge regression we have: d Again, let X ∈ Rn×d be factored as σi2 Yridge = ui uT Y σ2 i=1 i + δ2 i d X = U ΣV T = uiσiviT , i=1 where we have assumed that the rank of X is d. The least squares prediction is: d YLS = ui u T Y i i=1 The filter factor σi2 fi = σi2 + δ2 penalises small values of σ 2 (they go to zero at a faster rate).
  • 31. CPSC-540: Machine Learning 61 CPSC-540: Machine Learning 62 Small eigenvectors tend to be wobbly. The Ridge filter fac- tor fi gets rid of the wobbly eigenvectors. Therefore, the predictions tend to be more stable (smooth, regularised). The smoothness parameter δ 2 is often estimated by cross- validation or Bayesian hierarchical methods. Minimax and cross-validation Cross-validation is a widely used technique for choosing δ. Also, by increasing δ 2 we are penalising the weights: Here’s an example:
  • 32. CPSC-540: Machine Learning 63 CPSC-540: Machine Learning 64 Lecture 8 - Maximum Likelihood θ = arg max p(x1:n|θ) and Bayesian Learning θ For identical and independent distributed OBJECTIVE: In this chapter, we revise maximum like- (i.i.d.) data: lihood (ML) for a simple binary model. We then introduce Bayesian learning for this simple model and for the linear- Gaussian regression setting of the previous chapters. The key p(x1:n|θ) = difference between the two approaches is that the frequentist view assumes there is one true model responsible for the ob- servations, while the Bayesian view assumes that the model L(θ) = log p(x1:n|θ) = is a random variable with a certain prior distribution. Com- putationally, the ML problem is one of optimization, while Bayesian learning is one of integration. 3 MAXIMUM LIKELIHOOD Let’s illustrate this with a coin-tossing example. Frequentist Learning assumes that there is a true model (say a parametric model with parameters θ0). The estimate is denoted θ. It can be found by maximising the likelihood:
  • 33. CPSC-540: Machine Learning 65 CPSC-540: Machine Learning 66 Let x1:n, with xi ∈ {0, 1}, be i.i.d. Bernoulli: 3 BAYESIAN LEARNING Given our prior knowledge p(θ) and the data model p(·|θ), n p(x1:n|θ) = p(xi|θ) the Bayesian approach allows us to update our prior using i=1 the new data x1:n as follows: p(x1:n|θ)p(θ) p(θ|x1:n) = p(x1:n) where p(θ|x1:n) is the posterior distribution, p(x1:n|θ) is the likelihood and p(x1:n) is the marginal likelihood (evidence). Note With m xi, we have p(x1:n) = p(x1:n|θ)p(θ)dθ L(θ) = Differentiating, we get
  • 34. CPSC-540: Machine Learning 67 CPSC-540: Machine Learning 68 Bayesian Prediction Let x1:n, with xi ∈ {0, 1}, be i.i.d. Bernoulli: xi ∼ B(1, θ) We predict by marginalising over the posterior of the para- meters n p(x1:n|θ) = p(xi|θ) = θm(1 − θ)n−m i=1 p(xn+1|x1:n) = p(xn+1, θ|x1:n)dθ Let us choose the following Beta prior distribution: = p(xn+1|θ)p(θ|x1:n)dθ Γ(α)Γ(β) α−1 p(θ) = θ (1 − θ)β−1 Γ(α + β) Bayesian Model Selection where Γ denotes the Gamma-function. For the time For a particular model structure Mi, we have being, α and β are fixed hyper-parameters. The posterior distribution is proportional to: p(x1:n|θ, Mi)p(θ|Mi) p(θ|x1:n, Mi) = p(x1:n|Mi) p(θ|x) ∝ Models are selected according to their posterior: P (Mi|x1:n) ∝ P (x1:n|Mi)p(Mi) = P (Mi) p(x1:n|θ, Mi)p(θ|Mi)dθ The ratio P (x1:n|Mi)/P (x1:n|Mj ) is known as the Bayes with normalisation constant Factor.
  • 35. CPSC-540: Machine Learning 69 CPSC-540: Machine Learning 70 Since the posterior is also Beta, we say that the Beta prior The generalisation of the Beta distribution is the Dirichlet is conjugate with respect to the binomial likelihood. Con- distribution D(αi), with density jugate priors lead to the same form of posterior. k p(θ) ∝ θi i−1 α Different hyper-parameters of the Beta Be(α, β) distribution i=1 give rise to different prior specifications: where we have assumed k possible thetas. Note that the Dirichlet distribution is conjugate with respect to a Multinomial likelihood.
  • 36. CPSC-540: Machine Learning 71 CPSC-540: Machine Learning 72 3 BAYESIAN LEARNING FOR LINEAR-GAUSSIAN MOD- ELS −1 − 12 (θ−µ)T M −1 (θ−µ) p(θ|X, Y ) = 2πσ 2M 2 e2σ In the Bayesian linear prediction setting, we focus on com- −n − 12 (Y −Xθ)T (Y −Xθ) −d 1 − 2 2 θT θ ∝ 2πσ 2 2 e2σ 2πσ 2δ 2 2 e2δ σ puting the posterior: p(θ|X, Y ) ∝ p(Y |X, θ)p(θ) −n − 12 (Y −Xθ)T (Y −Xθ) = 2πσ 2 2 e2σ p(θ) We often want to maximise the posterior — that is, we look for the maximum a poteriori (MAP) estimate. In this case, the choice of prior determines a type of constraint! For ex- ample, consider a Gaussian prior θ ∼ N (0, δ 2σ 2Id). Then −n − 12 (Y −Xθ)T (Y −Xθ) −d 1 − 2 2 θT θ p(θ|X, Y ) ∝ 2πσ 2 2 e2σ 2πσ 2δ 2 2 e2δ σ Our task is to rearrange terms in the exponents in order to obtain a simple expression for the posterior distribution.
  • 37. CPSC-540: Machine Learning 73 CPSC-540: Machine Learning 74 So the posterior for θ is Gaussian: 2 Full Bayesian Model −1 − 12 (θ−µ)T M −1 (θ−µ) p(θ|X, Y ) = 2πσ 2M 2 e2σ In Bayesian inference, we’re interested in the full posterior: with sufficient statistics: p(θ, σ 2, δ 2|X, Y ) ∝ p(Y |θ, σ 2, X)p(θ|σ 2, δ 2)p(σ 2)p(δ 2) E(θ|X, Y ) = (XX T + δ −2Id)−1X T Y where var(θ|X, Y ) = (XX T + δ −2Id)−1σ 2 Y |θ, σ 2, X ∼ N (Xθ, σ 2In) The MAP point estimate is: θ ∼ N 0, (σ 2δ 2Id) σ 2 ∼ IG (a/2, b/2) θM AP = (XX T + δ −2Id)−1X T Y δ 2 ∼ IG (α, β) It is the same as the ridge estimate (except for a trivial where IG (α, β) denotes the Inverse-Gamma distribu- negative sign in the exponent of δ), which results from the tion. L2 constraint. A flat (“vague”) prior with large variance β α −β/δ2 2 −α−1 (large δ) leads to the ML estimate. δ 2 ∼ IG (α, β) = e (δ ) I[0,∞)(δ 2) Γ(α) δ 2 →0 θM AP = θridge −→ θM L = θSV D = θLS This is the conjugate prior for the variance of a Gaussian. The generalization of the Gamma distribution, i.e. the conjugate prior of a covariance matrix is the inverse
  • 38. CPSC-540: Machine Learning 75 CPSC-540: Machine Learning 76 Wishart distribution Σ ∼ IWd(α, αΣ∗), admitting the The product of likelihood and priors is: density −n − 12 (Y −Xθ)T (Y −Xθ) p(θ, σ 2, δ 2|X, Y ) ∝ 2πσ 2 2 e2σ −d 1 p(Σ|α, Σ∗) ∝ |Σ|−(α+d+1)/2 exp{−(1/2)tr(αΣ∗Σ−1)} × 2πσ 2δ 2 2 − 2 2 θT θ e2δ σ − b 2 2 −α−1 − β × (σ 2)−a/2−1e 2σ (δ ) e δ2 We can visualise our hierarchical model with the following graphical model: We know from our previous work on computing the posterior for θ that: − 12 Y T P Y p(θ, σ 2, δ 2|X, Y ) ∝ (2πσ 2)−n/2e 2σ − 12 (θ−µ)T M −1 (θ−µ) × (2πσ 2δ 2)−d/2e 2σ − b 2 2 −α−1 − β × (σ 2)−a/2−1e 2σ (δ ) e δ2 where M −1 = X T X + δ −2Id µ = M XT Y P = In − XM X T
  • 39. CPSC-540: Machine Learning 77 CPSC-540: Machine Learning 78 From this expression, it is now obvious that Integrating over σ 2 gives us an expression for p(δ 2|X, Y ) p(θ|σ 2, X, Y ) = N (µ, σ 2M ) Next, we integrate p(θ, σ 2, δ 2|X, Y ) over θ in order to get an expression for p(σ 2, δ 2|X, Y ). This will allow us to get an expression for the marginal posterior p(σ 2|X, Y ). Unfortunately this is a nonstandard distribution, thus mak- ing it hard for us to come up with the normalizing constant. a + n b + Y PY p(σ 2|X, Y ) ∼ IG , 2 2 So we’ll make use of the fact that we know θ and σ 2 to derive
  • 40. CPSC-540: Machine Learning 79 CPSC-540: Machine Learning 80 a conditional distribution p(δ 2|θ, σ 2, X, Y ). In summary, we can • Obtain p(θ|σ 2, δ 2, X, Y ) analytically. We know that: −n − 12 (Y −Xθ)T (Y −Xθ) • Obtain p(σ 2|δ 2, X, Y ) analytically. p(θ, σ 2, δ 2|X, Y ) ∝ 2πσ 2 2 e 2σ −d 1 − 2 2 θT θ • Derive an expression for p(δ 2|θ, σ 2, X, Y ). × 2πσ 2δ 2 2 e2δ σ − b 2 2 −α−1 − β × (σ 2)−a/2−1e 2σ (δ ) e δ2 Given δ 2, we can obtain analytical expressions for θ and σ 2. But, let’s be a bit more ambitious. Imagine we could run the following sampling algorithm (known as the Gibbs Sampler)
  • 41. CPSC-540: Machine Learning 81 CPSC-540: Machine Learning 82 For example, the predictive distribution 1. LOAD data (X, Y ). p(yn+1|X1:n+1, Y ) = p(yn+1|θ, σ 2, xn+1)p(θ, σ 2, δ 2|X, Y )dθdσ 2dδ 2 2. Compute X T Y and X T X. can be approximated with: 3. Set, e.g., a = b = 0, α = 2 and β = 10. N 1 p(yn+1|X1:n+1, Y ) = p(yn+1|θ(i), σ 2(i), xn+1) 4. Sample δ 2(0) ∼ IG (α, β). N i=1 5. FOR i = 1 to N : That is, (a) Compute M , P and µ(i) using δ 2(i−1). p(yn+1|X1:n+1, Y ) = (b) Sample σ 2(i) ∼ IG a+n b+Y P Y 2 , 2 . (c) Sample θ(i) ∼ N (µ(i), σ 2(i)M ). (i)T (i) (d) Sample δ 2(i) ∼ IG d 2 + α, β + θ 2σ2(i) θ . We can use these samples in order to approximate the inte- In the next lecture, we will derive the theory that justifies grals of interest with Monte Carlo averages. the use of this algorithm as well a many other Monte Carlo algorithms.