5. Some Limitations
● Domain adaptation
● Performance
depends on number
of classes
● Complexity grows
with number of
classes
● Hard to extend.
6. Example: Tracking
● A target is identified in a
video, we want the
system to follow its
location and pose over
time.
● Template based
– Template drift problem
– Template udpate strategies
● … We're pretty good at it
now.
Videos from the ALIEN tracker,
Z. Kalal, K. Mikolajczyk, and J. Matas,
“Tracking-Learning-Detection,” IEEE TPAMI 2011.
F.Pernici. “FaceHugger: The ALIEN Tracker Applied to Faces.”
ECCV 2012
7. Robot Vision?
● Navigation (path planning,
obstacle avoidance, SLAM)
● Grasping, manipulation,
tool use.
● Planning (not strictly vision,
but connected)
● Human-robot interaction?
● Mostly a strong need for
precise 3D estimates of the
world and objects' shapes.
NAO robot (Aldebaran robotics)
8. Robot Vision: Grasping ?
● Grasping remains a challenging
task.
● Five-finger hands are complex
to control.
● Choosing (stable) points of
contact for fingers – depends on
texture, object's 3D shape and
weight...
●
Precise 3D shape and 6D pose
estimation, motion planning,
obstacle detection...
● Hard to estimate from vision...
R. Detry, C. H. Ek, M. Madry, J. Piater and D. Kragic,
Generalizing Grasps Across Partly Similar Objects.
IEEE ICRA 2012.
9. Robot Vision: Affordances ??
● James J. Gibson The Theory
of Affordances (1977)
● Latent “action possibilities”
connected to objects.
● Affordance generalisation
across object classes...
● Neural evidence: Mirror
neurons (Rizzolatti, G., Craighero,
L.: The mirror neuron system. Annual
Review of Physiology 27, 169–192,
2004)
10. Robot Vision: Tool use ???
● Using tools for solving
tasks is still a
challenge – especially
learning to!
● Primates (and even
some birds) can do it
(The Mentality of
Apes. Wolfgang,
Kohler, 1925).
?
13. So... what is vision?
● Loosely defined concept
● Pretty much vision is what we experience on a
daily basis
● A rich, vivid and complete representation of
the world...
● … except most of it is made up...
14. The truth about human vision
● Human eye:
– high resolution in a small, central area called the fovea
(cones).
– colour only in the fovea (cones).
– very coarse elsewhere.
– low light and motion sensitivity in the periphery (rods).
– we're virtually blind to static areas.
– Ah... and we have a significant blind spot in the middle of
our field of view.
– … never noticed all that?
15. Human vision: the dualist illusion
● Our intuition is similar to
Descartes' vision
● “The cartesian theatre”
● We now know this is not the
case (from neuroscience).
● There is no clear
delineation in the brain
between perception and
cognition.
Vision module
Cognition/
consciousness
Action module
Diagram from Descartes' “Meditations”
16. Vision in the brain
Figure 25-12 from E.R. Kandel, J.H. Schwartz and T.M. Jessel, Eds.
Principles of Neural Science, 4th
Edition.
17. Cognitive Vision
● The ideal vision of
vision as a separate
module feeding
information to
cognition does not
work.
● So, where do we put
the bar?
Low level
signal processing
High-level cognition,
consciousness
Cognitive vision
feedback
18. Today's roadmap
● A (non-)definition of cognitive
vision and its flavours
– Cognitivist/Symbolic AI approach and
its problems
● The frame problem
● Symbol grounding problem
– The emergent view
● Aside: Neural networks
– The embodiment question
● How to get there? Some insights
from representation learning and
deep architectures.
– Autoencoders
– Convolutional networks
19. What is Cognitive Vision?
● H.H. Nagel (2003):
– improving computer vision algorithms by
adding numerous consistency check
mechanisms, at a logical level.
● David Vernon (2008, first draft 2004):
– “... attempt to achieve more robust, resilient
and adaptable computer vision systems by
endowing them with cognitive capabilities”
– “... able to adapt to unforeseen changes in
the visual environment”
– “... in essence, a combination of computer
vision and cognition”
● Multiple approaches to Cog-V
– Symbolic AI
– Emergent view
– Embodied AI
Cognitive vision ?
Symbolic AI
(dualist)
Emergent
Embodied
H.H. Nagel, Reflections on cognitive vision systems. In proc. of ICVS 2003.
D. Vernon. Cognitive Vision: The case for an embodied perception. Image and Vision Computing 26 (2008).
20. Example of Cognitive Architecture
The KnowHow system
KnowRob -- A Knowledge Processing Infrastructure for Cognition-enabled Robots.
Part 1: The KnowRob System (Moritz Tenorth, Michael Beetz), IJRR 2013.
21. Symbolic AI
●
Cognition involves operations
over symbolic representations.
● “Perception” is the process
abstracting symbolic
representations from sensory
signals.
●
Mostly, the symbolic
representation is the product of
human design and choice.
●
→ problem when we go away
from the domain of human
experience (ie, “semantic gap”) sensory signals
interpretation
symbolic
representation
logical
reasoning
22. The symbol grounding problem
Searle's “Chinese room argument” (1980):
– The symbols do not have the same semantics attached to
them as for the designer...
Harnad (1990)
– Cognition is more than symbol manipulation
→ In other words, the system should learn its own symbols,
grounded in its own experiences...
Barsalou (1999)
– Cognition is inherently perceptual
– (and therefore, perception is inherently cognitive)
23. The frame problem in AI – part I
(Daniel C. Dennett)
● Once upon a time,
there was a robot,
called R1...
"Cognitive Wheels: The Frame Problem of AI,"
in C. Hookway, ed., Minds, Machines and Evolution,
Cambridge University Press 1984, 129-151.
PULL WAGON
24. The frame problem in AI – part I
(Daniel C. Dennett)
● Once upon a time,
there was a robot,
called R1...
"Cognitive Wheels: The Frame Problem of AI,"
in C. Hookway, ed., Minds, Machines and Evolution,
Cambridge University Press 1984, 129-151.
25. The frame problem in AI – part II
(Daniel C. Dennett)
● A new robot was built
to recognise, and
handle side-effects:
R1D1
Pulling wagon does not change
wall colour
Pull the wagon?
Pulling the wagon does not
discharge the batteries
...
...
26. The frame problem in AI – part III
(Daniel C. Dennett)
● The designers built a
third robot to assess
the relevance of
implications: Say
hello to R2D1.
...
27. The frame problem in AI – part III
(Daniel C. Dennett)
● In sum, any action requires a large, a priori
unknown, amount of world knowledge
● Hard to predict for the system designer
● Hard to deduce by symbolically by the system
● → need for common sense associations
For vision: it is hard to predetermine a priori
the features and detectors that will be
required.
28. Issues with Symbolic AI
● Symbolic AI is an
efficient architecture
● Has solved successfully
some hard problems
● ...but faces some
complex limitations due
to the separation
between symbolic /
sub-symbolic
components.
Low level
signal processing
High-level cognition,
consciousness
Symbolic reasoning
detectors
symbols
Computer vision
AI
29. Emergent Cognition
● The system develops its own
epistemiology (set of symbols &
associations) from interacting with its
environment.
●
Enactive view (Maturana, H., Varela, F. The
Tree of Knowledge – The Biological Roots of
Human Understanding. New Science Library,
Boston & London (1987))
– autonomous system
– can affect the environment
– is affected by the environment
(embodied)
– self-organised and self-generated.
●
Central nervous system
– prediction & adaptation
Fig from Vernon, von Hofsten & Fadiga
“A Roadmap for Cognitive Development
in Humanoid Robots”. Springer, 2010.
30. Emergent Cognition: Shared
Epistemiology
● Pb: different experience → different symbols!
● Shared epistemiology comes from communication between
agents (my and your concept of “red” are shared, even if you're
colour blind)
● Note: communication between artificial systems can be a lot faster!
31. Artificial Neural Networks
● An “artificial neuron”
is in effect
– a linear
transformation
– a linear squashing
function s
an=f w ,b(x)=s(∑
i
wi xi+b)
x1
x2
x3
+1
n an
w1
w2
w3
b
33. Artificial Neural Network
(aka Multilayer perceptron)
x1
x2
x3
h1
r1
h2
+1+1
input layer
layer #1
(N^1=3 inputs)
“hidden” layer
layer #2
(N^2=2 nodes)
output layer
layer #3
(N^3=1 node)
θ=(W
1
,b
1
,W
2
,b
2
)
parameters:
f θ(x)=s( ∑
j∈[1,N
2
]
W j1
2
s( ∑
i∈[1, N
1
]
Wij
1
xi+bi
1
)+bj
2
)
zi
l+1
= ∑
j∈[1, N
l
]
W j1
l
aj
l
+bj
l
ai
l
=s(zi
l
)
Generic node activation:
b1
2
34. Learning by back-propagation
E=
1
2
∥a
3
− y∥
δj
L
=
∂ E
∂ aj
L
s'(zj
L
)
(⇔δj
L
=(aj
L
− y j)s'(zj
L
))
x1
x2
x3
h1
r1
h2
+1+1
input layer
layer #1
(N^1=3 inputs)
“hidden” layer
layer #2
(N^2=2 nodes)
output layer
layer #3
(N^3=1 node)
(x , y)
δj
l
=∑
i
W ji
l
δi
l+1
s'(zj
l
)
δ1
3
δ1
2
TOP LAYER ERROR
OTHER LAYERS ERROR
For a given datapoint with label
We have a error for the network
a1
3
b1
2
35. Learning by back-propagation
x1
x2
x3
h1
r1
h2
+1+1
input layer
layer #1
(N^1=3 inputs)
“hidden” layer
layer #2
(N^2=2 nodes)
output layer
layer #3
(N^3=1 node)
b1
2
δ1
3
δ1
2
a1
3
∂ E
∂Wij
l
=ai
l
δj
l+1
∂ E
∂bj
l
=δj
l+1
→ Update parameters
with gradient descent
Finally we get the error derivative for all
network parameters:
36. Embodiment
● Idea: Concepts can only be
learnt for and by a body
– → being affected by the
environment
– actions and perception and learnt
jointly.
– good perception is what allows
successful actions.
● Example of reaching with neural
network (Jamone, L.; Natale, L.; Metta,
G.; Nori, F.; Sandini, G. .2012, “Autonomous
Online Learning of Reaching Behavior in a
Humanoid Robot.” International Journal of
Humanoid Robotics 9(3), 2012.)
37. Do we need embodiment?
● If you buy the emergent thesis, it is required
– joint development of perception & action
– symbol grounding in experience
– → emergent epistemiology
● What type of embodiment?
– strong: physical body (or even organic body!)
– weak: a system coupled with its environment
●
it can affect its environment, and
● it is affected by it
38. Phylogeny vs. Ontogeny
● Phylogeny: the system's design (eg features like SIFT
or lines). High with cognitivist approach, more limited
in the emergent paradigm.
● Ontogeny: the system's development during its
lifetime, drawn from experiences with its environment.
● Challenges for artificial systems:
– hard to learn high level, abstract symbols autonomously.
– hard to generalise across experiences
– → how to learn abstract representations from experience?
39. Representation Learning
● Simple example: PCA
● Aim: identify dimensions that
vary jointly
● Components are axes of largest
variation.
● Linear transformation
● Orthogonal basis
● Applied on natural images,
generate filters similar to early
cortical cells (V1)
PJB Hancock, RJ Baddeley and LS Smith (1992)
The principal components of natural images
Network: computation in neural systems 3(1)
y=W
T
x+μ
40. Arguments for deep hierarchies
● Feature sharing at intermediate
levels → sub-linear coding and
computation requirements (Fidler,
Boben & Leonardis. Evaluating multi-class
learning strategies in a generative hierarchical
framework for object detection. NIPS'09.)
● compact coding (Bengio, Courville, &
Vincent. Representation Learning: A Review and
New Perspectives. IEEE PAMI 35(8), 2013.)
● → Human visual system, estimated
to have 5-10 levels (Krueger et al, “Deep
Hierarchies in the Primate Visual Cortex: What
Can We Learn for Computer Vision? 2013)
● →NN,CART,SVM → 2 layers
Figure from Fidler, Boben & Leonardis 2009
41. Arguments for Deep Hierarchies
● Problem with linear
representations :
– A combination of any
number of linear
representations is
also a linear
representation...
y=W1
T
x+μ1
z=W2
T
y+μ2
⇔ z=W2
T
W1
T
x+μ2+μ1
⇔ z=W3
T
x+μ3
x
y
(W1,μ1)
z
(W2, μ2)
(W3, μ3)
42. Data driven hierarchies:
Autoencoders
● Idea: learn jointly a
pair of mappings
and
● that minimises
information loss
● often using a neural
network formulation
z=ψ( y)y=ϕ(x)
argmin
ϕ ,ψ
∑
x
∥x−ψ(ϕ(x))∥D
ϕ(x)=s(W x+b)
ψ(x)=W ' x+b'
x
y
ϕ ψ
x1 x2 x3 x4
y1 y2 y3
43. Remember: ANNs
● An “artificial neuron”
is in effect
– a linear
transformation
– a linear squashing
function s
an=f w ,b(x)=s(∑
i
wi xi+b)
x1
x2
x3
+1
n an
w1
w2
w3
b
44. Data driven hierarchies:
Autoencoders
● Idea: learn jointly a
pair of mappings
and
● that minimises
information loss
● often using a neural
network formulation
z=ψ( y)y=ϕ(x)
argmin
ϕ ,ψ
∑
x
∥x−ψ(ϕ(x))∥D
ϕ(x)=s(W x+b)
ψ(x)=W ' x+b'
x
y
ϕ ψ
x1 x2 x3 x4
y1 y2 y3
45. Data driven hierarchies:
Sparse Autoencoders
● Trivial solution when
dim(Y) >= dim(X) !
● But overcomplete bases
can be beneficial
(Olhsausen & Field 1996)
● Solution→sparse coding
argmin
ϕ ,ψ
∑
x
∥x−ψ(ϕ(x))∥D+g(ϕ(x))
x
y
ϕ ψ
Olshausen, B. and Field, D. (1996). Emergence of simple-cell receptive field properties by learning
a sparse code for natural images. Nature, 381(6583):607–609.
46. Stacked Auto-encoders
● you can stack
multiple layers of AE
● Trained layer-wise
● Note that the
structure is the same
as ANN.
● Can be fine-tuned
with backpropagation
x
h
ϕ1 ψ1
y
ϕ2 ψ2
47. Limitations of ANN
● Problem with ANN, doesn't work well with more than 2
layers
● pb. with backprop, probably the gradient gets too
diluted.
● Problem for emergent cognition: we want to learn
higher level of abstraction!
● More recently several alternatives have been
developped (Deep Learning): Restricted Boltzman
Machines (RBM), Stacked autoencoder, Convolutional
nets.
48. Today's hot topic:
Convolutional Neural Nets
● CNNs are neural nets (of
course)
● sparse connectivity
● shared weight →
convolutional.
● receptive field span all
input dimensions
● typically alternating layers
of convolution and
max-pooling
x1 x1 x1 x1 x1
h1 h1 h1
Fig from http://deeplearning.net/tutorial/lenet.html
49. CNNs (cont'd)
● ex: LeNet (LeCun et al, 1998)
● Alternating convolution & subsampling (ie, max-pooling) layes
● Top-layer is a typical ANN.
● Train using backprop & stochastic gradient descent
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, 1998.
Figure from http://deeplearning.net/tutorial/lenet.html
50. CNNs (cont'd)
● Pb: training deep networks is difficult with backprop (slow, requires LOTS of data)
● CNN do better (because sparse) but still a pb.
● → Unsupervised pre-training of the network
– using, ie, sparse autoencoders, layer-wise.
– refine the weights with supervised backprop afterwards.
● Top results in MNIST, ILSRVC, PASCAL VOC.
51. Other Part-based Hierarchies
● Deep belief networks (DBN): Restricted
Boltzmann Machines
(Hinton, Osindero, and Teh, “A Fast Learning Algorithm
for Deep Belief Nets,” Neural Computation 18, 2006.)
● Slow Feature Analysis (SFA) (Franzius,
Wilber and Wiskott. “Invariant object recognition and
pose estimation with slow feature analysis”. Neural
Computation, 2011)
● Compositional hierarchies (Fidler, Boben &
Leonardis. “Evaluating multi-class learning strategies in
a generative hierarchical framework for object
detection”. NIPS, 2009)
● → Good review by Yoshua Bengio:
Yoshua Bengio, Aaron Courville, and
Pascal Vincent. “Representation
Learning: A Review and New
Perspectives.” IEEE PAMI 35(8), 2013.
Fidler, S., M. Boben, and A. Leonardis 2009a.
“Learning hierarchical compositional representations of object structure.”
Pp. 196-215 in Object categorization : computer and human vision perspectives,
edited by Sven J Dickinson, Aleš Leonardis, Bernt Schiele, and Michael J Tarr.
New York: Cambridge University Press.
52. Summary and conclusions
● There is no delineation
between cognition and vision.
● Reasoning on hand-crafted
symbols may be inadequate
(semantic gap) or brittle.
● Learning abstraction is hard,
but possible using deep
hierarchies.
● Unsupervised pre-training for
deep hierarchies is critical →
tells us something about
cognition.