SlideShare une entreprise Scribd logo
1  sur  95
Télécharger pour lire hors ligne
Deep Nets, Bayes and the story of AI (continued)
David Barber
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Intelligent Machinery
1948 Turing and Champernowne ‘paper and pencil’ chess.
Intelligent Machinery
1951 Prinz mate-in-two moves chess machine.
1952 Strachey programs first computer draughts algorithm.
Learning Machines
1951 Oettinger makes first program that ‘learns’.
1955 Samuel adds ‘learning’ to his draughts algorithm.
Logical Intelligence
1968 Risch’s algorithm for integration in calculus
1972 Prolog for general logical reasoning
1997 Deep Blue defeats Kasparov
Other forms of intelligence
But is this getting us to where we’d like to be?
Selfridge-Shannon film clip
Speech Recognition
Visual Processing
Natural Language modelling
Planning and decision in uncertain environments
Perhaps a different approach would be useful.
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Astonishing Hypothesis: Crick
“A person’s mental activities are entirely due to the behaviour of nerve
cells and the molecules that make them up and influence them.”
Neurons
Visual Pathway
Information Processing in Brains
Neurons
RealWorld
Layer 1 Layer 2 High−level
Concepts
Feature
Hierarchical; Modular; Binary; Parallel; Noisy
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Artificial Neuron (Perceptron)
weight 7
output neuron
neuron 1
neuron 2
neuron 3
neuron 4
neuron 7
neuron 6
neuron 5
inputs
weight 1
Training an artificial neural network
Want to generalise to new images with high accuracy.
Artificial Network
1957 Rosenblatt’s perceptron
perceptron film clip
Connectionism
1960 Realised a perceptron can only solve simple tasks.
1970 Decline in interest.
1980 New computing power made training multilayer networks feasible.
outputinputs
Each node (or ‘neuron’) computes a function of a weighted combination of
parental nodes: hj = σ( i wijhi)
Neural Networks and Deep Learning
Historical Problems with Neural Nets (1990s)
NNs are difficult to train (many local optima?).
Particularly difficult to train a NN with a large number of layers (say larger
than around 10).
‘Gradient Diffusion Problem’ – difficult to assign responsibility of errors to
individual ‘neurons’.
Machine Learning (up to 2006)
A large section of the machine learning community abandoned NNs.
More principled and computationally better understood techniques (SVMs
and related convex methods) replaced them.
Bayesian AI (1990s onwards)
From mid 1990s there was a realisation that pattern recognition is not
sufficient for all AI purposes.
Uncertainty and reasoning are not naturally representable using standard
feed-forward nets.
Explosion in more ‘symbolic’ Bayesian AI.
Deep Learning
NNs have resurged in interest in the last few years (Hinton, Bengio, . . . )
Also called ‘deep learning’.
Sense that very complex tasks (object recognition, learning complex structure
in data) requires going beyond simple (convex) statistical techniques.
The brain uses hierarchical distributed processing and it is likely to be for a
good reason.
Many problems have a hierarchical structure: images are made of parts;
language is hierarchical, etc.
Why now?
New computing resources (GPU processing)
Availability of large amount of data means that we can train nets with many
parameters (1010
).
Recent evidence suggests local optima are not particularly problematic.
Autoencoder
y1 y2 y3 y4 y5
h1 h2 h3
h4 h5
y1 y2 y3 y4 y5
h6 h7 h8
The bottleneck forces the network to try to find a low dimensional
representation of the data.
Useful for unsupervised learning.
Autoencoder on MNIST digits (Hinton 2006 Science)
Figure : Reconstructions using H = 30 components. From the Top: Original image,
Autoencoder1, Autoencoder2, PCA
60,000 training images (28 × 28 = 784 pixels).
Use a form of autoencoder to find a lower (30) dimensional representation.
At the time, the special layerwise training procedure was considered
fundamental to the success of this approach. Now not deemed necessary,
provided we use a sensible initialisation.
Google Cats
10 Million Youtube video frames (200x200 pixel images).
Use a specialised autoencoder with 9 layers (1 billion weights).
2000 computers + two weeks of computing.
Examine units to see what images they most respond to.
Google Autoencoder
From Nando De Freitas
Convolutional NNs
CNNs are particularly popular in image processing
Often the feature maps correspond (not to macro features such as bicycles)
but micro features.
For example, in handwritten digit recognition they correspond to small
constituent parts of the digits.
These are used then to process the image into a representation that is better
for recognition.
NNs in NLP
Bag of Words
We have D words in a dictionary, aardvark,. . .,zorro so that we can relate
each word with its dictionary index.
We can also think of this as a Euclidean embedding e:
aardvark → eaardvark =





1
0
...
0





, zorro → ezorro =





0
0
...
1





Word Embeddings
Idea is to replace the Euclidean embeddings e with embeddings (vectors) v
that are learned.
Objective is, for example, next word prediction accuracy.
These are often called ‘neural language models’.
NNs in NLP
Each word w in the dictionary has an associated embedding vector vw.
Usually around 200 dimensional vectors are used.
Consider the sentence:
the cat sat on the mat
and that we wish to predict the word on given the two preceding cat sat
and two succeeding words the mat
We can use a network that has inputs vcat, vsat, vthe, vmat
The output of the network is a probability over all words in the dictionary
p(w| {vinputs}).
We want p(w = on|vcat, vsat, vthe, vmat) to be high.
The overall objective is then to learn all the word embeddings and network
parameters subject to predicting the word correctly based on the context.
Word Embeddings
Given a word (France, for example) we can find which words w have embedding
vectors closest to vFrance. From Ronan Collabert (2011).
Word Embeddings
There appears to be a natural ‘geometry’ to the embeddings. For example, there
are directions that correspond to gender.
vwoman − vman ≈ vaunt − vuncle
vwoman − vman ≈ vqueen − vking
From Mikolov (2013).
Word Embeddings: Analogies
Given a relationship, France-Paris, we get the ‘relationship’ embedding
v = vParis − vFrance
Given Italy we can calculate vItaly + v and find the word in the dictionary which
has closest embedding to this (it turns out to be Rome!). From Mikolov (2013).
Word Embeddings: Constrained Embeddings
We can learn embeddings for English words and embeddings for Chinese
words.
However, when we know that a Chinese and English word have a similar
meaning, we add a constraint that the word embeddings vChineseWord and
vEnglishWord should be close.
We have only a small amount of labelled ‘similar’ Chinese-English words
(these are the green border boxes in the above; they are standard translations
of the corresponding Chinese character).
We can visualise in 2D (using t-SNE) the embedding vectors. See Socher
(2013)
Word Embeddings: Constrained Embeddings
Recursive Nets and Embeddings
Stanford Sentiment Treebank. Consists of parsed sentences with sentiment labels
(−−, −, 0, +, ++) for each node (phrase) in the tree. 215,000 labelled phrases
(obtained from three humans).
Recursive Nets and Embeddings
Idea is to recursively combine embeddings such that they accurately predict
the sentiment at each node.
Recursive Nets and Embeddings
Training
We have a softmax classifier for each node in the tree, to predict the
sentiment of the phrase beneath this node in the tree.
The weights of this classifier are shared across all nodes.
At the leaf nodes at the bottom of the tree, the inputs to the classifiers are
the word embeddings.
The embeddings are combined by another network g with common
parameters, which forms the input to the sentiment classifier.
We then learn all the embeddings, shared classifier parameters and shared
combination parameters to maximise the classification accuracy.
Prediction
For a new movie review, the review is first parsed using a standard grammar
tree parser.
This forms the tree which can be used to recursively form the sentiment class
label for the review.
Currently the best sentiment classifier. Socher (2013)
Recursive Nets and Embeddings
õ õ
ð
ð
α¹»®
ð
ܱ¼¹»®
õ
õ
ð
·­
õ
ð
±²»
õ
ð
±º
õ
õ
ð
¬¸»
õ
õ
ð
³±­¬
õ
½±³°»´´·²¹
ð
ª¿®·¿¬·±²­
ð
ð
±²
ð
ð
¬¸·­
ð
¬¸»³»
ð
ò
¥
ð
ð
α¹»®
ð
ܱ¼¹»®
¥
¥
ð
·­
¥
ð
±²»
¥
ð
±º
¥
¥
ð
¬¸»
¥
¥
¥
´»¿­¬
õ
½±³°»´´·²¹
ð
ª¿®·¿¬·±²­
ð
ð
±²
ð
ð
¬¸·­
ð
¬¸»³»
ð
ò
õ
ð
×
õ
õ
õ
´·µ»¼
ð
ð
ð
»ª»®§
ð
ð
­·²¹´»
ð
³·²«¬»
ð
ð
±º
ð
ð
¬¸·­
ð
ð
ò
¥
ð
×
¥
¥
ð
ð
¼·¼
ð
²ù¬
ð
ð
´·µ»
ð
ð
ð
¿
ð
ð
­·²¹´»
ð
³·²«¬»
ð
ð
±º
ð
ð
¬¸·­
ð
ð
ò
¥
ð
׬
¥
¥
ð
ð
ù­
ð
¶«­¬
¥
õ
·²½®»¼·¾´§
¥ ¥
¼«´´
ð
ò
ð
ð
׬
ð
ð
ð
ð
ð
ù­
õ
¥
²±¬
¥ ¥
¼«´´
ð
ò
Ú·¹«®» çæ ÎÒÌÒ °®»¼·½¬·±² ±º °±­·¬·ª» ¿²¼ ²»¹¿¬·ª» ø¾±¬¬±³ ®·¹¸¬÷ ­»²¬»²½»­ ¿²¼ ¬¸»·® ²»¹¿¬·±²ò
Recurrent Nets
x1 x2 x3
h1 h2 h3
y1 y2 y3
A A A
C C C
B B
RNNs are used in timeseries applications
The basic idea is that the hidden units at time ht (and possibly output yt)
depend on the previous state of the network ht−1, xt−1, yt−1 for inputs xt and
outputs yt.
In the above network, I ‘unrolled the net through time’ to give a standard NN
diagram.
I omitted the potential links from xt−1, yt−1 to ht.
Handwriting Generation using a RNN
Some training examples.
Handwriting Generation using a RNN
Some generated examples. Top line is real handwriting, for comparison. See Alex
Grave’s work.
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reasons research in deep learning has exploded
Much greater compute power. (GPU)
Much larger datasets.
AutoDiff.
What is AutoDiff?
AutoDiff takes a function f(x) and returns an exact value (up to machine
accuracy) for the gradient
gi(x) ≡
∂
∂xi
f
x
Note that this is not the same as a numerical approximation (such as central
differences) for the gradient.
One can show that, if done efficiently, one can always calculate the gradient in
less than 5 times the time it takes to compute f(x).
Reverse Differentiation
A useful graphical representation is that the total derivative of f with respect to x
is given by the sum over all path values from x to f, where each path value is the
product of the partial derivatives of the functions on the edges:
df
dx
=
∂f
∂x
+
∂f
∂g
dg
dx
x
f
g∂f
∂x
dg
dx
∂f
∂g
Example
For f(x) = x2
+ xgh, where g =
x2
and h = xg2
x
f
gh2x + gh
2x
xh
2gx
xg
g2
f (x) = (2x + gh) + (g2
xg) + (2x2gxxg) + (2xxh) = 2x + 8x7
Reverse Differentiation
Consider
f(x1, x2) = cos (sin(x1x2))
We can represent this computationally using an Abstract Syntax Tree (AST):
x1 x2
f1
f2
f3
f1(x1, x2) = x1x2
f2(x) = sin(x)
f3(x) = cos(x)
Given values for x1, x2, we first run forwards through the tree so that we can
associate each node with an actual function value.
Reverse Differentiation
x1 x2
f1
f2
f3
df3
dx1
=
∂f3
∂f2
df2
dx1
=
∂f3
∂f2
df2
df1
df3
df1
df1
dx1
Similarly,
df3
dx2
=
∂f3
∂f2
df2
df1
df3
df1
df1
dx2
The two derivatives share the same computation branch and
we want to exploit this.
Reverse Differentiation
x1 x2
f1
f2
f3
∂f1
∂x1
= x2
∂f1
∂x2
= x1
∂f2
∂f1
= cos(f1)
∂f3
∂f2
= − sin(f2)
1. Find the reverse ancestral (backwards) schedule
of nodes (f3, f2, f1, x1, x2).
2. Start with the first node n1 in the reverse
schedule and define tn1 = 1.
3. For the next node n in the reverse schedule, find
the child nodes ch (n). Then define
tn =
c∈ch(n)
∂fc
∂fn
tc
4. The total derivatives of f with respect to the
root nodes of the tree (here x1 and x2) are given
by the values of t at those nodes.
This is a general procedure that can be used to automatically define a subroutine
to efficiently compute the gradient. It is efficient because information is collected
at nodes in the tree and split between parents only when required.
Limitations of forward reasoning
World Representation
Recognising patterns (perceptron style) is only one form of intelligence.
Solving chess problems is another and requires complex reasoning using some
form of internal model.
The world is noisy and information may be conflicting.
Recognised that new approaches are required.
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Limitations of forward reasoning
World Representation
Models help us to fantasise about the world.
Models
Models
Burglar Problem
Creaks and Bumps
Creak Bump
Burglar Model
pos1 pos2 pos3 pos4
snd1 snd2 snd3 snd4
pos - position in kitchen
snd – sound
Finding the Burglar
creak! creak!
bump!
creak!
bump! bump!
creak!
bump! bump! bump!
creak!
bump!
Finding the Burglar
creak! creak!
bump!
creak!
bump! bump!
creak!
bump! bump! bump!
creak!
bump!
Finding the Burglar
creak! creak!
bump!
creak!
bump! bump!
creak!
bump! bump! bump!
creak!
bump!
Stubby Fingers
Stubby Fingers
int1 int2 int3 int4
hit1 hit2 hit3 hit4
int - intended key
hit – hit key
Stubby Fingers: errors
a b c d e f g h i j k l m n o p q r s t u v w x y z
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Stubby Fingers: language
a b c d e f g h i j k l m n o p q r s t u v w x y z
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Stubby Fingers
Given the typed sequence cwsykcak what is the most likely word that this
corresponds to?
List the 200 most likely hidden sequences
Discard those that are not in a standard English dictionary
Take the most likely proper English word as the intended typed word
Speech Recognition: raw signal
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Time
‘neural’ representation
10 20 30 40 50 60 70 80
5
10
15
20
25
Speech Recognition
pho1 pho2 pho3 pho4
aud1 aud2 aud3 aud4
pho: phoneme (letter)
aud: audio signal (neural representation)
Medical Diagnosis
tumour flu meningitis
headache fever appetite x-ray
Combine known medical knowledge with patient specific information.
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Probability
Why Probability?
Probability is a logical calculus of uncertainty.
Natural framework to use in models of physical systems, such as the Ising
Model (1920) and in AI applications, such as the HMM (Baum 1966,
Stratonovich 1960).
The need for structure
We often want to make a probabilistic description of many objects (electron
spins, neurons, customers, etc. ).
Typically the representational and computational cost of probabilistic models
grows exponentially with the number of objects represented.
Without introducing strong structural limitations about how these objects can
interact, probability is a non-starter.
For this reason, computationally ‘simpler’ alternatives (such as fuzzy logic)
were introduced to try to avoid some of these difficulties – however, these are
typically frowed on by purists.
Graphical Models
We can use graphs to represent how objects can probabilistically interact with
each other.
Graphical Models and then a marriage between Graph and Probability theory.
Many of the quantities that we would like to compute in a probability
distribution can then be related to operations on the graph.
The computational complexity of operations can often be related to the
structure of the graph.
Graphical Models are now used as a standard framework in Engineering,
Statistics and Computer Science.
Graphical Models are used to perform reasoning under uncertainty and are
therefore widely applicable.
Uses in Industry
Microsoft: used to estimate the skill distribution of players in online games
(the worlds largest graphical model?!).
Hospitals use Belief Nets to encode knowledge about diseases and symptoms
to aid medical diagnosis.
Google, Microsoft, Facebook: used in many places, including advertising,
video game prediction, speech recognition.
Used to estimate inherent desirability of products in consumer retail.
Microsoft and others: Attempt to go beyond simple A/B testing by uses
Graphical Models to model the whole company/user relationship.
Conditional Probability and Bayes’ Rule
The probability of event x conditioned on knowing event y (or more shortly, the
probability of x given y) is defined as
p(x|y) ≡
p(x, y)
p(y)
=
p(y|x)p(x)
p(y)
(Bayes’ rule)
Throwing darts
p(region 5|not region 20) =
p(region 5, not region 20)
p(not region 20)
=
p(region 5)
p(not region 20)
=
1/20
19/20
=
1
19
Interpretation
p(A = a|B = b) should not be interpreted as ‘Given the event B = b has occurred,
p(A = a|B = b) is the probability of the event A = a occurring’. The correct
interpretation should be ‘p(A = a|B = b) is the probability of A being in state a
under the constraint that B is in state b’.
Battleships
Assume there are 2 ships, 1 vertical (ship 1) and 1 horizontal (ship 2), of 5
pixels each.
Can be placed anywhere on the 10×10 grid, but cannot overlap.
Let s1 is the origin of ship 1 and s2 the origin of ship 2
Data D is a collection of query ‘hit’ or ‘miss’ responses.
p(s1, s2|D) =
p(D|s1, s2)p(s1, s2)
p(D)
Let X be the matrix of pixel occupancy
p(X|D) =
s1,s2
p(X, s1, s2|D) =
s1,s2
p(X|s1, s2)p(s1, s2|D)
demoBattleships.m
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Belief Networks (Bayesian Networks)
A belief network is a directed acyclic graph in which each node has associated the
conditional probability of the node given its parents.
The joint distribution is obtained by taking the product of the conditional
probabilities:
p(A, B, C, D, E) = p(A)p(B)p(C|A, B)p(D|C)p(E|B, C)
p(E|B, C)
A B
C
D
E
Example – Part I
Sally’s burglar Alarm is sounding. Has she been Burgled, or was the alarm
triggered by an Earthquake? She turns the car Radio on for news of earthquakes.
Choosing an ordering
Without loss of generality, we can write
p(A, R, E, B) = p(A|R, E, B)p(R, E, B)
= p(A|R, E, B)p(R|E, B)p(E, B)
= p(A|R, E, B)p(R|E, B)p(E|B)p(B)
Assumptions:
The alarm is not directly influenced by any report on the radio,
p(A|R, E, B) = p(A|E, B)
The radio broadcast is not directly influenced by the burglar variable,
p(R|E, B) = p(R|E)
Burglaries don’t directly ‘cause’ earthquakes, p(E|B) = p(E)
Therefore
p(A, R, E, B) = p(A|E, B)p(R|E)p(E)p(B)
Example – Part II: Specifying the Tables
B
A
E
R
p(A|B, E)
Alarm = 1 Burglar Earthquake
0.9999 1 1
0.99 1 0
0.99 0 1
0.0001 0 0
p(R|E)
Radio = 1 Earthquake
1 1
0 0
The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001. The tables
and graphical structure fully specify the distribution.
Example Part III: Inference
Initial Evidence: The alarm is sounding
p(B = 1|A = 1) =
E,R p(B = 1, E, A = 1, R)
B,E,R p(B, E, A = 1, R)
=
E,R p(A = 1|B = 1, E)p(B = 1)p(E)p(R|E)
B,E,R p(A = 1|B, E)p(B)p(E)p(R|E)
≈ 0.99
Additional Evidence: The radio broadcasts an earthquake warning:
A similar calculation gives p(B = 1|A = 1, R = 1) ≈ 0.01.
Initially, because the alarm sounds, Sally thinks that she’s been burgled.
However, this probability drops dramatically when she hears that there has
been an earthquake.
Markov Models
For timeseries data v1, . . . , vT , we need a model p(v1:T ). For causal consistency, it
is meaningful to consider the decomposition
p(v1:T ) =
T
t=1
p(vt|v1:t−1)
with the convention p(vt|v1:t−1) = p(v1) for t = 1.
v1 v2 v3 v4
Independence assumptions
It is often natural to assume that the influence of the immediate past is more
relevant than the remote past and in Markov models only a limited number of
previous observations are required to predict the future.
Markov Chain
Only the recent past is relevant:
p(vt|v1, . . . , vt−1) = p(vt|vt−L, . . . , vt−1)
where L ≥ 1 is the order of the Markov chain
p(v1:T ) = p(v1)p(v2|v1)p(v3|v2) . . . p(vT |vT −1)
For a stationary Markov chain the transitions p(vt = s |vt−1 = s) = f(s , s) are
time-independent (‘homogeneous’).
v1 v2 v3 v4
(a)
v1 v2 v3 v4
(b)
Figure : (a): First order Markov chain. (b): Second order Markov chain.
Markov Chains
v1 v2 v3 v4
p(v1, . . . , vT ) = p(v1)
initial
T
t=2
p(vt|vt−1)
Transition
State transition diagram
Nodes represent states of the variable v and arcs non-zero elements of the
transition p(vt|vt−1)
1 2
34
56
7
8 9
Most probable and shortest paths
1 2
34
56
7
8 9
The shortest (unweighted) path from state 1 to state 7 is 1 − 2 − 7.
The most probable path from state 1 to state 7 is 1 − 8 − 9 − 7 (assuming
uniform transition probabilities). The latter path is longer but more probable
since for the path 1 − 2 − 7, the probability of exiting state 2 into state 7 is
1/5.
Equilibrium distribution
It is interesting to know how the marginal p(xt) evolves through time:
p(xt = i) =
j
p(xt = i|xt−1 = j)
Mij
p(xt−1 = j)
p(xt = i) is the frequency that we visit state i at time t, given we started
from p(x1) and randomly drew samples from the transition p(xτ |xτ−1).
As we repeatedly sample a new state from the chain, the distribution at time
t, for an initial distribution p1(i) is
pt = Mt−1
p1
If, for t → ∞, p∞ is independent of the initial distribution p1, then p∞ is
called the equilibrium distribution of the chain:
p∞ = Mp∞
The equil. distribution is proportional to the eigenvector with unit eigenvalue
of the transition matrix.
PageRank
Define the matrix
Aij =
1 if website j has a hyperlink to website i
0 otherwise
From this we can define a Markov transition matrix with elements
Mij =
Aij
i Ai j
If we jump from website to website, the equilibrium distribution component
p∞(i) is the relative number of times we will visit website i. This has a
natural interpretation as the ‘importance’ of website i.
For each website i a list of words associated with that website is collected.
After doing this for all websites, one can make an ‘inverse’ list of which
websites contain word w. When a user searches for word w, the list of
websites that contain word is then returned, ranked according to the
importance of the site.
Hidden Markov Models
The HMM defines a Markov chain on hidden (or ‘latent’) variables h1:T . The
observed (or ‘visible’) variables are dependent on the hidden variables through an
emission p(vt|ht). This defines a joint distribution
p(h1:T , v1:T ) = p(v1|h1)p(h1)
T
t=2
p(vt|ht)p(ht|ht−1)
For a stationary HMM the transition p(ht|ht−1) and emission p(vt|ht) distributions
are constant through time.
v1 v2 v3 v4
h1 h2 h3 h4 Figure : A first order hidden Markov model
with ‘hidden’ variables
dom(ht) = {1, . . . , H}, t = 1 : T. The
‘visible’ variables vt can be either discrete or
continuous.
The classical inference problems
Filtering (Inferring the present) p(ht|v1:t)
Prediction (Inferring the future) p(ht|v1:s) t > s
Smoothing (Inferring the past) p(ht|v1:u) t < u
Likelihood p(v1:T )
Most likely path (Viterbi alignment) argmax
h1:T
p(h1:T |v1:T )
For prediction, one is also often interested in p(vt|v1:s) for t > s.
Inference in Hidden Markov Models
Belief network representation of a HMM:
h1 h2 h3 h4
v1 v2 v3 v4
Filtering, Smoothing and Viterbi are all computationally efficient, scaling
linearly with the length of the timeseries (but quadratically with the number
of hidden states).
The algorithms are variants of ‘message passing on factor graphs’
Algorithm guaranteed to work if the graph is singly-connected.
Huge research effort in the last 15 years to apply message passing for
approximate inference in multiply-connected graphs (eg low-density
parity-check codes).
HMMs for speech recognition
ht is the phoneme at time t. p(ht|ht−1) – language model. p(vt|ht) – speech
signal model.
Deep Nets and HMMs
h1 h2 h3 h4
v1 v2 v3 v4
Recently companies including Google have made big advances in speech
recognition.
The breakthrough is to model p(vt|ht) as a Gaussian whose mean is some
function of the phoneme µ(ht; θ).
This function is a deep neural network, trained on a large amount of data.
Goldrush at the moment to find similar breakthrough applications of deep
networks in reasoning systems.
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Generative Model
h1 h2
v1 v2 v3 v4
It is natural to consider that objects (images for example) can be constructed
on the basis of a low dimensional representation.
Note that this is a Graphical Model, not a Function
The latent variables h can be sampled from, using p(h) and then an image
sampled from p(v|h).
One cannot use an autoencoder to generate new images.
The bad news
Inference (computing p(h|v) and parameter learning) is intractable in these
models.
Statisticians typically use sampling as an approximation.
Very popular in ML to use a variational method – much faster for inference.
Variational Inference
Consider a distribution
p(v|θ) =
h
p(v|h, θ)p(h)
and that we wish to learn θ to maximise the probability this model generates
observed data.
log p(v|θ) ≥ − q(h|v, φ) log q(h|v, φ) +
h
q(h|v, φ)p(v|h, θ) + const.
Idea is to choose a ‘variational’ distribution q(h|v, φ) such that we can either
calculate analytically the bound, or sample it efficiently.
We then jointly maximise the bound wrt φ and θ.
We can parameterise p(v|h, θ) using a deep network.
Very popular approach – see ‘variational autoencoder’ and also attention
mechanisms.
Extension to semi-supervised method using p(v) = h c p(v|h, c)p(c)p(h)
DRAW
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Reinforcement Learning
Can we teach computers to play Atari video games?
Deep Reinforcement Learning
Given a state of the world, W and a set of possible actions A, we need to
decide which action to taken for any state of W that will be best for our long
term goals.
Problem is that the number of pixel states is enormous.
Need to learn a low dimensional representation of the screen (use a deep
generative model).
Learn then which action to take given the low dimensional representation.
Tetris
Google
Table of Contents
History of the AI dream
How do brains work?
Connectionism
AutoDiff
Fantasy Machines
Probability
Directed Graphical Models
Variational Generative Models
Reinforcement Learning
Outlook
Outlook
Machine Learning is in a boom period.
Renewed interest and hope in creating AI.
Combine new computational power with suitable hierarchical representations.
Impressive state of the art results in Speech Recognition, Image Analysis,
Game Playing.
Challenges
Improve understanding of optimisation for deep learning.
Learn how to more efficiently exploit computational resources.
Learn how to exploit massive databases.
Improve interaction between reinforcement learning and representation
learning.
Marry non-symbolic (neural) with symbolic (Bayesian reasoning)
Emphasis is on scalability.
Feel free to contact me at UCL or at my AI company reinfer
https:://reinfer.io

Contenu connexe

Tendances

Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
[PR12] Spectral Normalization for Generative Adversarial Networks
[PR12] Spectral Normalization for Generative Adversarial Networks[PR12] Spectral Normalization for Generative Adversarial Networks
[PR12] Spectral Normalization for Generative Adversarial NetworksJaeJun Yoo
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Modelsbutest
 
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Julien SIMON
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Appsilon Data Science
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItHolberton School
 
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...Deep Learning JP
 
Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature surveyAkshay Hegde
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 

Tendances (20)

Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
[PR12] Spectral Normalization for Generative Adversarial Networks
[PR12] Spectral Normalization for Generative Adversarial Networks[PR12] Spectral Normalization for Generative Adversarial Networks
[PR12] Spectral Normalization for Generative Adversarial Networks
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
MILA DL & RL summer school highlights
MILA DL & RL summer school highlights MILA DL & RL summer school highlights
MILA DL & RL summer school highlights
 
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
 
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
 
Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature survey
 
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
 

Similaire à David Barber - Deep Nets, Bayes and the story of AI

Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...John Mathon
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To SentencesIRJET Journal
 
Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Paul Houle
 
7-200404101602.pdf
7-200404101602.pdf7-200404101602.pdf
7-200404101602.pdfssuser07e9f2
 
Deep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfDeep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfYungSang1
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceJonathan Mugan
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Wanjin Yu
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Multimediaexercise
MultimediaexerciseMultimediaexercise
MultimediaexerciseRony Mohamed
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson StudioSasha Lazarevic
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2Viral Gupta
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Objects for modeling world
Objects for modeling worldObjects for modeling world
Objects for modeling worldEswaran P
 

Similaire à David Barber - Deep Nets, Bayes and the story of AI (20)

Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
Artificial Intelligence is back, Deep Learning Networks and Quantum possibili...
 
Semeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic SimilaritySemeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic Similarity
 
Scene Description From Images To Sentences
Scene Description From Images To SentencesScene Description From Images To Sentences
Scene Description From Images To Sentences
 
Image captioning
Image captioningImage captioning
Image captioning
 
Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Generative models
Generative modelsGenerative models
Generative models
 
7-200404101602.pdf
7-200404101602.pdf7-200404101602.pdf
7-200404101602.pdf
 
Deep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfDeep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdf
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
 
Query Understanding
Query UnderstandingQuery Understanding
Query Understanding
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Multimediaexercise
MultimediaexerciseMultimediaexercise
Multimediaexercise
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Deep learning
Deep learning Deep learning
Deep learning
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Objects for modeling world
Objects for modeling worldObjects for modeling world
Objects for modeling world
 

Dernier

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 

Dernier (20)

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 

David Barber - Deep Nets, Bayes and the story of AI

  • 1. Deep Nets, Bayes and the story of AI (continued) David Barber
  • 2. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 3. Intelligent Machinery 1948 Turing and Champernowne ‘paper and pencil’ chess.
  • 4. Intelligent Machinery 1951 Prinz mate-in-two moves chess machine. 1952 Strachey programs first computer draughts algorithm.
  • 5. Learning Machines 1951 Oettinger makes first program that ‘learns’. 1955 Samuel adds ‘learning’ to his draughts algorithm.
  • 6. Logical Intelligence 1968 Risch’s algorithm for integration in calculus 1972 Prolog for general logical reasoning 1997 Deep Blue defeats Kasparov
  • 7. Other forms of intelligence But is this getting us to where we’d like to be? Selfridge-Shannon film clip Speech Recognition Visual Processing Natural Language modelling Planning and decision in uncertain environments Perhaps a different approach would be useful.
  • 8. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 9. Astonishing Hypothesis: Crick “A person’s mental activities are entirely due to the behaviour of nerve cells and the molecules that make them up and influence them.”
  • 12. Information Processing in Brains Neurons RealWorld Layer 1 Layer 2 High−level Concepts Feature Hierarchical; Modular; Binary; Parallel; Noisy
  • 13. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 14. Artificial Neuron (Perceptron) weight 7 output neuron neuron 1 neuron 2 neuron 3 neuron 4 neuron 7 neuron 6 neuron 5 inputs weight 1
  • 15. Training an artificial neural network Want to generalise to new images with high accuracy.
  • 16. Artificial Network 1957 Rosenblatt’s perceptron perceptron film clip
  • 17. Connectionism 1960 Realised a perceptron can only solve simple tasks. 1970 Decline in interest. 1980 New computing power made training multilayer networks feasible. outputinputs Each node (or ‘neuron’) computes a function of a weighted combination of parental nodes: hj = σ( i wijhi)
  • 18. Neural Networks and Deep Learning Historical Problems with Neural Nets (1990s) NNs are difficult to train (many local optima?). Particularly difficult to train a NN with a large number of layers (say larger than around 10). ‘Gradient Diffusion Problem’ – difficult to assign responsibility of errors to individual ‘neurons’. Machine Learning (up to 2006) A large section of the machine learning community abandoned NNs. More principled and computationally better understood techniques (SVMs and related convex methods) replaced them. Bayesian AI (1990s onwards) From mid 1990s there was a realisation that pattern recognition is not sufficient for all AI purposes. Uncertainty and reasoning are not naturally representable using standard feed-forward nets. Explosion in more ‘symbolic’ Bayesian AI.
  • 19. Deep Learning NNs have resurged in interest in the last few years (Hinton, Bengio, . . . ) Also called ‘deep learning’. Sense that very complex tasks (object recognition, learning complex structure in data) requires going beyond simple (convex) statistical techniques. The brain uses hierarchical distributed processing and it is likely to be for a good reason. Many problems have a hierarchical structure: images are made of parts; language is hierarchical, etc. Why now? New computing resources (GPU processing) Availability of large amount of data means that we can train nets with many parameters (1010 ). Recent evidence suggests local optima are not particularly problematic.
  • 20. Autoencoder y1 y2 y3 y4 y5 h1 h2 h3 h4 h5 y1 y2 y3 y4 y5 h6 h7 h8 The bottleneck forces the network to try to find a low dimensional representation of the data. Useful for unsupervised learning.
  • 21. Autoencoder on MNIST digits (Hinton 2006 Science) Figure : Reconstructions using H = 30 components. From the Top: Original image, Autoencoder1, Autoencoder2, PCA 60,000 training images (28 × 28 = 784 pixels). Use a form of autoencoder to find a lower (30) dimensional representation. At the time, the special layerwise training procedure was considered fundamental to the success of this approach. Now not deemed necessary, provided we use a sensible initialisation.
  • 22. Google Cats 10 Million Youtube video frames (200x200 pixel images). Use a specialised autoencoder with 9 layers (1 billion weights). 2000 computers + two weeks of computing. Examine units to see what images they most respond to.
  • 24. Convolutional NNs CNNs are particularly popular in image processing Often the feature maps correspond (not to macro features such as bicycles) but micro features. For example, in handwritten digit recognition they correspond to small constituent parts of the digits. These are used then to process the image into a representation that is better for recognition.
  • 25. NNs in NLP Bag of Words We have D words in a dictionary, aardvark,. . .,zorro so that we can relate each word with its dictionary index. We can also think of this as a Euclidean embedding e: aardvark → eaardvark =      1 0 ... 0      , zorro → ezorro =      0 0 ... 1      Word Embeddings Idea is to replace the Euclidean embeddings e with embeddings (vectors) v that are learned. Objective is, for example, next word prediction accuracy. These are often called ‘neural language models’.
  • 26. NNs in NLP Each word w in the dictionary has an associated embedding vector vw. Usually around 200 dimensional vectors are used. Consider the sentence: the cat sat on the mat and that we wish to predict the word on given the two preceding cat sat and two succeeding words the mat We can use a network that has inputs vcat, vsat, vthe, vmat The output of the network is a probability over all words in the dictionary p(w| {vinputs}). We want p(w = on|vcat, vsat, vthe, vmat) to be high. The overall objective is then to learn all the word embeddings and network parameters subject to predicting the word correctly based on the context.
  • 27. Word Embeddings Given a word (France, for example) we can find which words w have embedding vectors closest to vFrance. From Ronan Collabert (2011).
  • 28. Word Embeddings There appears to be a natural ‘geometry’ to the embeddings. For example, there are directions that correspond to gender. vwoman − vman ≈ vaunt − vuncle vwoman − vman ≈ vqueen − vking From Mikolov (2013).
  • 29. Word Embeddings: Analogies Given a relationship, France-Paris, we get the ‘relationship’ embedding v = vParis − vFrance Given Italy we can calculate vItaly + v and find the word in the dictionary which has closest embedding to this (it turns out to be Rome!). From Mikolov (2013).
  • 30. Word Embeddings: Constrained Embeddings We can learn embeddings for English words and embeddings for Chinese words. However, when we know that a Chinese and English word have a similar meaning, we add a constraint that the word embeddings vChineseWord and vEnglishWord should be close. We have only a small amount of labelled ‘similar’ Chinese-English words (these are the green border boxes in the above; they are standard translations of the corresponding Chinese character). We can visualise in 2D (using t-SNE) the embedding vectors. See Socher (2013)
  • 32. Recursive Nets and Embeddings Stanford Sentiment Treebank. Consists of parsed sentences with sentiment labels (−−, −, 0, +, ++) for each node (phrase) in the tree. 215,000 labelled phrases (obtained from three humans).
  • 33. Recursive Nets and Embeddings Idea is to recursively combine embeddings such that they accurately predict the sentiment at each node.
  • 34. Recursive Nets and Embeddings Training We have a softmax classifier for each node in the tree, to predict the sentiment of the phrase beneath this node in the tree. The weights of this classifier are shared across all nodes. At the leaf nodes at the bottom of the tree, the inputs to the classifiers are the word embeddings. The embeddings are combined by another network g with common parameters, which forms the input to the sentiment classifier. We then learn all the embeddings, shared classifier parameters and shared combination parameters to maximise the classification accuracy. Prediction For a new movie review, the review is first parsed using a standard grammar tree parser. This forms the tree which can be used to recursively form the sentiment class label for the review. Currently the best sentiment classifier. Socher (2013)
  • 35. Recursive Nets and Embeddings õ õ ð ð α¹»® ð ܱ¼¹»® õ õ ð ·­ õ ð ±²» õ 𠱺 õ õ 𠬸» õ õ ð ³±­¬ õ ½±³°»´´·²¹ 𠪿®·¿¬·±²­ ð ð ±² ð 𠬸·­ 𠬸»³» ð ò ¥ ð ð α¹»® ð ܱ¼¹»® ¥ ¥ ð ·­ ¥ ð ±²» ¥ 𠱺 ¥ ¥ 𠬸» ¥ ¥ ¥ ´»¿­¬ õ ½±³°»´´·²¹ 𠪿®·¿¬·±²­ ð ð ±² ð 𠬸·­ 𠬸»³» ð ò õ ð × õ õ õ ´·µ»¼ ð ð 𠻪»®§ ð ð ­·²¹´» ð ³·²«¬» ð 𠱺 ð 𠬸·­ ð ð ò ¥ ð × ¥ ¥ ð ð ¼·¼ ð ²ù¬ ð ð ´·µ» ð ð ð ¿ ð ð ­·²¹´» ð ³·²«¬» ð 𠱺 ð 𠬸·­ ð ð ò ¥ ð ׬ ¥ ¥ ð ð ù­ 𠶫­¬ ¥ õ ·²½®»¼·¾´§ ¥ ¥ ¼«´´ ð ò ð ð ׬ ð ð ð ð ð ù­ õ ¥ ²±¬ ¥ ¥ ¼«´´ ð ò Ú·¹«®» çæ ÎÒÌÒ °®»¼·½¬·±² ±º °±­·¬·ª» ¿²¼ ²»¹¿¬·ª» ø¾±¬¬±³ ®·¹¸¬÷ ­»²¬»²½»­ ¿²¼ ¬¸»·® ²»¹¿¬·±²ò
  • 36. Recurrent Nets x1 x2 x3 h1 h2 h3 y1 y2 y3 A A A C C C B B RNNs are used in timeseries applications The basic idea is that the hidden units at time ht (and possibly output yt) depend on the previous state of the network ht−1, xt−1, yt−1 for inputs xt and outputs yt. In the above network, I ‘unrolled the net through time’ to give a standard NN diagram. I omitted the potential links from xt−1, yt−1 to ht.
  • 37. Handwriting Generation using a RNN Some training examples.
  • 38. Handwriting Generation using a RNN Some generated examples. Top line is real handwriting, for comparison. See Alex Grave’s work.
  • 39. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 40. Reasons research in deep learning has exploded Much greater compute power. (GPU) Much larger datasets. AutoDiff. What is AutoDiff? AutoDiff takes a function f(x) and returns an exact value (up to machine accuracy) for the gradient gi(x) ≡ ∂ ∂xi f x Note that this is not the same as a numerical approximation (such as central differences) for the gradient. One can show that, if done efficiently, one can always calculate the gradient in less than 5 times the time it takes to compute f(x).
  • 41. Reverse Differentiation A useful graphical representation is that the total derivative of f with respect to x is given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the functions on the edges: df dx = ∂f ∂x + ∂f ∂g dg dx x f g∂f ∂x dg dx ∂f ∂g Example For f(x) = x2 + xgh, where g = x2 and h = xg2 x f gh2x + gh 2x xh 2gx xg g2 f (x) = (2x + gh) + (g2 xg) + (2x2gxxg) + (2xxh) = 2x + 8x7
  • 42. Reverse Differentiation Consider f(x1, x2) = cos (sin(x1x2)) We can represent this computationally using an Abstract Syntax Tree (AST): x1 x2 f1 f2 f3 f1(x1, x2) = x1x2 f2(x) = sin(x) f3(x) = cos(x) Given values for x1, x2, we first run forwards through the tree so that we can associate each node with an actual function value.
  • 44. Reverse Differentiation x1 x2 f1 f2 f3 ∂f1 ∂x1 = x2 ∂f1 ∂x2 = x1 ∂f2 ∂f1 = cos(f1) ∂f3 ∂f2 = − sin(f2) 1. Find the reverse ancestral (backwards) schedule of nodes (f3, f2, f1, x1, x2). 2. Start with the first node n1 in the reverse schedule and define tn1 = 1. 3. For the next node n in the reverse schedule, find the child nodes ch (n). Then define tn = c∈ch(n) ∂fc ∂fn tc 4. The total derivatives of f with respect to the root nodes of the tree (here x1 and x2) are given by the values of t at those nodes. This is a general procedure that can be used to automatically define a subroutine to efficiently compute the gradient. It is efficient because information is collected at nodes in the tree and split between parents only when required.
  • 45. Limitations of forward reasoning World Representation Recognising patterns (perceptron style) is only one form of intelligence. Solving chess problems is another and requires complex reasoning using some form of internal model. The world is noisy and information may be conflicting. Recognised that new approaches are required.
  • 46. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 47. Limitations of forward reasoning World Representation Models help us to fantasise about the world.
  • 52. Burglar Model pos1 pos2 pos3 pos4 snd1 snd2 snd3 snd4 pos - position in kitchen snd – sound
  • 53. Finding the Burglar creak! creak! bump! creak! bump! bump! creak! bump! bump! bump! creak! bump!
  • 54. Finding the Burglar creak! creak! bump! creak! bump! bump! creak! bump! bump! bump! creak! bump!
  • 55. Finding the Burglar creak! creak! bump! creak! bump! bump! creak! bump! bump! bump! creak! bump!
  • 57. Stubby Fingers int1 int2 int3 int4 hit1 hit2 hit3 hit4 int - intended key hit – hit key
  • 58. Stubby Fingers: errors a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
  • 59. Stubby Fingers: language a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
  • 60. Stubby Fingers Given the typed sequence cwsykcak what is the most likely word that this corresponds to? List the 200 most likely hidden sequences Discard those that are not in a standard English dictionary Take the most likely proper English word as the intended typed word
  • 61. Speech Recognition: raw signal 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 Time
  • 62. ‘neural’ representation 10 20 30 40 50 60 70 80 5 10 15 20 25
  • 63. Speech Recognition pho1 pho2 pho3 pho4 aud1 aud2 aud3 aud4 pho: phoneme (letter) aud: audio signal (neural representation)
  • 64. Medical Diagnosis tumour flu meningitis headache fever appetite x-ray Combine known medical knowledge with patient specific information.
  • 65. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 66. Probability Why Probability? Probability is a logical calculus of uncertainty. Natural framework to use in models of physical systems, such as the Ising Model (1920) and in AI applications, such as the HMM (Baum 1966, Stratonovich 1960). The need for structure We often want to make a probabilistic description of many objects (electron spins, neurons, customers, etc. ). Typically the representational and computational cost of probabilistic models grows exponentially with the number of objects represented. Without introducing strong structural limitations about how these objects can interact, probability is a non-starter. For this reason, computationally ‘simpler’ alternatives (such as fuzzy logic) were introduced to try to avoid some of these difficulties – however, these are typically frowed on by purists.
  • 67. Graphical Models We can use graphs to represent how objects can probabilistically interact with each other. Graphical Models and then a marriage between Graph and Probability theory. Many of the quantities that we would like to compute in a probability distribution can then be related to operations on the graph. The computational complexity of operations can often be related to the structure of the graph. Graphical Models are now used as a standard framework in Engineering, Statistics and Computer Science. Graphical Models are used to perform reasoning under uncertainty and are therefore widely applicable.
  • 68. Uses in Industry Microsoft: used to estimate the skill distribution of players in online games (the worlds largest graphical model?!). Hospitals use Belief Nets to encode knowledge about diseases and symptoms to aid medical diagnosis. Google, Microsoft, Facebook: used in many places, including advertising, video game prediction, speech recognition. Used to estimate inherent desirability of products in consumer retail. Microsoft and others: Attempt to go beyond simple A/B testing by uses Graphical Models to model the whole company/user relationship.
  • 69. Conditional Probability and Bayes’ Rule The probability of event x conditioned on knowing event y (or more shortly, the probability of x given y) is defined as p(x|y) ≡ p(x, y) p(y) = p(y|x)p(x) p(y) (Bayes’ rule) Throwing darts p(region 5|not region 20) = p(region 5, not region 20) p(not region 20) = p(region 5) p(not region 20) = 1/20 19/20 = 1 19 Interpretation p(A = a|B = b) should not be interpreted as ‘Given the event B = b has occurred, p(A = a|B = b) is the probability of the event A = a occurring’. The correct interpretation should be ‘p(A = a|B = b) is the probability of A being in state a under the constraint that B is in state b’.
  • 70. Battleships Assume there are 2 ships, 1 vertical (ship 1) and 1 horizontal (ship 2), of 5 pixels each. Can be placed anywhere on the 10×10 grid, but cannot overlap. Let s1 is the origin of ship 1 and s2 the origin of ship 2 Data D is a collection of query ‘hit’ or ‘miss’ responses. p(s1, s2|D) = p(D|s1, s2)p(s1, s2) p(D) Let X be the matrix of pixel occupancy p(X|D) = s1,s2 p(X, s1, s2|D) = s1,s2 p(X|s1, s2)p(s1, s2|D) demoBattleships.m
  • 71. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 72. Belief Networks (Bayesian Networks) A belief network is a directed acyclic graph in which each node has associated the conditional probability of the node given its parents. The joint distribution is obtained by taking the product of the conditional probabilities: p(A, B, C, D, E) = p(A)p(B)p(C|A, B)p(D|C)p(E|B, C) p(E|B, C) A B C D E
  • 73. Example – Part I Sally’s burglar Alarm is sounding. Has she been Burgled, or was the alarm triggered by an Earthquake? She turns the car Radio on for news of earthquakes. Choosing an ordering Without loss of generality, we can write p(A, R, E, B) = p(A|R, E, B)p(R, E, B) = p(A|R, E, B)p(R|E, B)p(E, B) = p(A|R, E, B)p(R|E, B)p(E|B)p(B) Assumptions: The alarm is not directly influenced by any report on the radio, p(A|R, E, B) = p(A|E, B) The radio broadcast is not directly influenced by the burglar variable, p(R|E, B) = p(R|E) Burglaries don’t directly ‘cause’ earthquakes, p(E|B) = p(E) Therefore p(A, R, E, B) = p(A|E, B)p(R|E)p(E)p(B)
  • 74. Example – Part II: Specifying the Tables B A E R p(A|B, E) Alarm = 1 Burglar Earthquake 0.9999 1 1 0.99 1 0 0.99 0 1 0.0001 0 0 p(R|E) Radio = 1 Earthquake 1 1 0 0 The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001. The tables and graphical structure fully specify the distribution.
  • 75. Example Part III: Inference Initial Evidence: The alarm is sounding p(B = 1|A = 1) = E,R p(B = 1, E, A = 1, R) B,E,R p(B, E, A = 1, R) = E,R p(A = 1|B = 1, E)p(B = 1)p(E)p(R|E) B,E,R p(A = 1|B, E)p(B)p(E)p(R|E) ≈ 0.99 Additional Evidence: The radio broadcasts an earthquake warning: A similar calculation gives p(B = 1|A = 1, R = 1) ≈ 0.01. Initially, because the alarm sounds, Sally thinks that she’s been burgled. However, this probability drops dramatically when she hears that there has been an earthquake.
  • 76. Markov Models For timeseries data v1, . . . , vT , we need a model p(v1:T ). For causal consistency, it is meaningful to consider the decomposition p(v1:T ) = T t=1 p(vt|v1:t−1) with the convention p(vt|v1:t−1) = p(v1) for t = 1. v1 v2 v3 v4 Independence assumptions It is often natural to assume that the influence of the immediate past is more relevant than the remote past and in Markov models only a limited number of previous observations are required to predict the future.
  • 77. Markov Chain Only the recent past is relevant: p(vt|v1, . . . , vt−1) = p(vt|vt−L, . . . , vt−1) where L ≥ 1 is the order of the Markov chain p(v1:T ) = p(v1)p(v2|v1)p(v3|v2) . . . p(vT |vT −1) For a stationary Markov chain the transitions p(vt = s |vt−1 = s) = f(s , s) are time-independent (‘homogeneous’). v1 v2 v3 v4 (a) v1 v2 v3 v4 (b) Figure : (a): First order Markov chain. (b): Second order Markov chain.
  • 78. Markov Chains v1 v2 v3 v4 p(v1, . . . , vT ) = p(v1) initial T t=2 p(vt|vt−1) Transition State transition diagram Nodes represent states of the variable v and arcs non-zero elements of the transition p(vt|vt−1) 1 2 34 56 7 8 9
  • 79. Most probable and shortest paths 1 2 34 56 7 8 9 The shortest (unweighted) path from state 1 to state 7 is 1 − 2 − 7. The most probable path from state 1 to state 7 is 1 − 8 − 9 − 7 (assuming uniform transition probabilities). The latter path is longer but more probable since for the path 1 − 2 − 7, the probability of exiting state 2 into state 7 is 1/5.
  • 80. Equilibrium distribution It is interesting to know how the marginal p(xt) evolves through time: p(xt = i) = j p(xt = i|xt−1 = j) Mij p(xt−1 = j) p(xt = i) is the frequency that we visit state i at time t, given we started from p(x1) and randomly drew samples from the transition p(xτ |xτ−1). As we repeatedly sample a new state from the chain, the distribution at time t, for an initial distribution p1(i) is pt = Mt−1 p1 If, for t → ∞, p∞ is independent of the initial distribution p1, then p∞ is called the equilibrium distribution of the chain: p∞ = Mp∞ The equil. distribution is proportional to the eigenvector with unit eigenvalue of the transition matrix.
  • 81. PageRank Define the matrix Aij = 1 if website j has a hyperlink to website i 0 otherwise From this we can define a Markov transition matrix with elements Mij = Aij i Ai j If we jump from website to website, the equilibrium distribution component p∞(i) is the relative number of times we will visit website i. This has a natural interpretation as the ‘importance’ of website i. For each website i a list of words associated with that website is collected. After doing this for all websites, one can make an ‘inverse’ list of which websites contain word w. When a user searches for word w, the list of websites that contain word is then returned, ranked according to the importance of the site.
  • 82. Hidden Markov Models The HMM defines a Markov chain on hidden (or ‘latent’) variables h1:T . The observed (or ‘visible’) variables are dependent on the hidden variables through an emission p(vt|ht). This defines a joint distribution p(h1:T , v1:T ) = p(v1|h1)p(h1) T t=2 p(vt|ht)p(ht|ht−1) For a stationary HMM the transition p(ht|ht−1) and emission p(vt|ht) distributions are constant through time. v1 v2 v3 v4 h1 h2 h3 h4 Figure : A first order hidden Markov model with ‘hidden’ variables dom(ht) = {1, . . . , H}, t = 1 : T. The ‘visible’ variables vt can be either discrete or continuous.
  • 83. The classical inference problems Filtering (Inferring the present) p(ht|v1:t) Prediction (Inferring the future) p(ht|v1:s) t > s Smoothing (Inferring the past) p(ht|v1:u) t < u Likelihood p(v1:T ) Most likely path (Viterbi alignment) argmax h1:T p(h1:T |v1:T ) For prediction, one is also often interested in p(vt|v1:s) for t > s.
  • 84. Inference in Hidden Markov Models Belief network representation of a HMM: h1 h2 h3 h4 v1 v2 v3 v4 Filtering, Smoothing and Viterbi are all computationally efficient, scaling linearly with the length of the timeseries (but quadratically with the number of hidden states). The algorithms are variants of ‘message passing on factor graphs’ Algorithm guaranteed to work if the graph is singly-connected. Huge research effort in the last 15 years to apply message passing for approximate inference in multiply-connected graphs (eg low-density parity-check codes).
  • 85. HMMs for speech recognition ht is the phoneme at time t. p(ht|ht−1) – language model. p(vt|ht) – speech signal model.
  • 86. Deep Nets and HMMs h1 h2 h3 h4 v1 v2 v3 v4 Recently companies including Google have made big advances in speech recognition. The breakthrough is to model p(vt|ht) as a Gaussian whose mean is some function of the phoneme µ(ht; θ). This function is a deep neural network, trained on a large amount of data. Goldrush at the moment to find similar breakthrough applications of deep networks in reasoning systems.
  • 87. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 88. Generative Model h1 h2 v1 v2 v3 v4 It is natural to consider that objects (images for example) can be constructed on the basis of a low dimensional representation. Note that this is a Graphical Model, not a Function The latent variables h can be sampled from, using p(h) and then an image sampled from p(v|h). One cannot use an autoencoder to generate new images. The bad news Inference (computing p(h|v) and parameter learning) is intractable in these models. Statisticians typically use sampling as an approximation. Very popular in ML to use a variational method – much faster for inference.
  • 89. Variational Inference Consider a distribution p(v|θ) = h p(v|h, θ)p(h) and that we wish to learn θ to maximise the probability this model generates observed data. log p(v|θ) ≥ − q(h|v, φ) log q(h|v, φ) + h q(h|v, φ)p(v|h, θ) + const. Idea is to choose a ‘variational’ distribution q(h|v, φ) such that we can either calculate analytically the bound, or sample it efficiently. We then jointly maximise the bound wrt φ and θ. We can parameterise p(v|h, θ) using a deep network. Very popular approach – see ‘variational autoencoder’ and also attention mechanisms. Extension to semi-supervised method using p(v) = h c p(v|h, c)p(c)p(h)
  • 90. DRAW
  • 91. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 92. Reinforcement Learning Can we teach computers to play Atari video games?
  • 93. Deep Reinforcement Learning Given a state of the world, W and a set of possible actions A, we need to decide which action to taken for any state of W that will be best for our long term goals. Problem is that the number of pixel states is enormous. Need to learn a low dimensional representation of the screen (use a deep generative model). Learn then which action to take given the low dimensional representation. Tetris Google
  • 94. Table of Contents History of the AI dream How do brains work? Connectionism AutoDiff Fantasy Machines Probability Directed Graphical Models Variational Generative Models Reinforcement Learning Outlook
  • 95. Outlook Machine Learning is in a boom period. Renewed interest and hope in creating AI. Combine new computational power with suitable hierarchical representations. Impressive state of the art results in Speech Recognition, Image Analysis, Game Playing. Challenges Improve understanding of optimisation for deep learning. Learn how to more efficiently exploit computational resources. Learn how to exploit massive databases. Improve interaction between reinforcement learning and representation learning. Marry non-symbolic (neural) with symbolic (Bayesian reasoning) Emphasis is on scalability. Feel free to contact me at UCL or at my AI company reinfer https:://reinfer.io