Cognitive Vision - After the hype

Cognitive Vision – After the hype
Nicolas Pugeault
n.pugeault@surrey.ac.uk
Centre for Vision, Speech and Signal Processing
University of Surrey

Example: detection/recognition
PASCAL Visual Object Classes Challenge 2007
● Given examples from
N classes, we want
to detect and
recognise new
instances of one
class in images

Some Limitations
● Domain adaptation
● Performance
depends on number
of classes
● Complexity grows
with number of
classes
● Hard to extend.

Example: Tracking
● A target is identified in a
video, we want the
system to follow its
location and pose over
time.
● Template based
– Template drift problem
– Template udpate strategies
● … We're pretty good at it
now.
Videos from the ALIEN tracker,
Z. Kalal, K. Mikolajczyk, and J. Matas,
“Tracking-Learning-Detection,” IEEE TPAMI 2011.
F.Pernici. “FaceHugger: The ALIEN Tracker Applied to Faces.”
ECCV 2012

Robot Vision?
● Navigation (path planning,
obstacle avoidance, SLAM)
● Grasping, manipulation,
tool use.
● Planning (not strictly vision,
but connected)
● Human-robot interaction?
● Mostly a strong need for
precise 3D estimates of the
world and objects' shapes.
NAO robot (Aldebaran robotics)

Robot Vision: Grasping ?
● Grasping remains a challenging
task.
● Five-finger hands are complex
to control.
● Choosing (stable) points of
contact for fingers – depends on
texture, object's 3D shape and
weight...
●
Precise 3D shape and 6D pose
estimation, motion planning,
obstacle detection...
● Hard to estimate from vision...
R. Detry, C. H. Ek, M. Madry, J. Piater and D. Kragic,
Generalizing Grasps Across Partly Similar Objects.
IEEE ICRA 2012.

Robot Vision: Affordances ??
● James J. Gibson The Theory
of Affordances (1977)
● Latent “action possibilities”
connected to objects.
● Affordance generalisation
across object classes...
● Neural evidence: Mirror
neurons (Rizzolatti, G., Craighero,
L.: The mirror neuron system. Annual
Review of Physiology 27, 169–192,
2004)

Robot Vision: Tool use ???
● Using tools for solving
tasks is still a
challenge – especially
learning to!
● Primates (and even
some birds) can do it
(The Mentality of
Apes. Wolfgang,
Kohler, 1925).
?

So... what is vision?
● Loosely defined concept
● Pretty much vision is what we experience on a
daily basis
● A rich, vivid and complete representation of
the world...
● … except most of it is made up...

The truth about human vision
● Human eye:
– high resolution in a small, central area called the fovea
(cones).
– colour only in the fovea (cones).
– very coarse elsewhere.
– low light and motion sensitivity in the periphery (rods).
– we're virtually blind to static areas.
– Ah... and we have a significant blind spot in the middle of
our field of view.
– … never noticed all that?

Human vision: the dualist illusion
● Our intuition is similar to
Descartes' vision
● “The cartesian theatre”
● We now know this is not the
case (from neuroscience).
● There is no clear
delineation in the brain
between perception and
cognition.
Vision module
Cognition/
consciousness
Action module
Diagram from Descartes' “Meditations”

Vision in the brain
Figure 25-12 from E.R. Kandel, J.H. Schwartz and T.M. Jessel, Eds.
Principles of Neural Science, 4th
Edition.

Cognitive Vision
● The ideal vision of
vision as a separate
module feeding
information to
cognition does not
work.
● So, where do we put
the bar?
Low level
signal processing
High-level cognition,
consciousness
Cognitive vision
feedback

Today's roadmap
● A (non-)definition of cognitive
vision and its flavours
– Cognitivist/Symbolic AI approach and
its problems
● The frame problem
● Symbol grounding problem
– The emergent view
● Aside: Neural networks
– The embodiment question
● How to get there? Some insights
from representation learning and
deep architectures.
– Autoencoders
– Convolutional networks

What is Cognitive Vision?
● H.H. Nagel (2003):
– improving computer vision algorithms by
adding numerous consistency check
mechanisms, at a logical level.
● David Vernon (2008, first draft 2004):
– “... attempt to achieve more robust, resilient
and adaptable computer vision systems by
endowing them with cognitive capabilities”
– “... able to adapt to unforeseen changes in
the visual environment”
– “... in essence, a combination of computer
vision and cognition”
● Multiple approaches to Cog-V
– Symbolic AI
– Emergent view
– Embodied AI
Cognitive vision ?
Symbolic AI
(dualist)
Emergent
Embodied
H.H. Nagel, Reflections on cognitive vision systems. In proc. of ICVS 2003.
D. Vernon. Cognitive Vision: The case for an embodied perception. Image and Vision Computing 26 (2008).

Example of Cognitive Architecture
The KnowHow system
KnowRob -- A Knowledge Processing Infrastructure for Cognition-enabled Robots.
Part 1: The KnowRob System (Moritz Tenorth, Michael Beetz), IJRR 2013.

Symbolic AI
●
Cognition involves operations
over symbolic representations.
● “Perception” is the process
abstracting symbolic
representations from sensory
signals.
●
Mostly, the symbolic
representation is the product of
human design and choice.
●
→ problem when we go away
from the domain of human
experience (ie, “semantic gap”) sensory signals
interpretation
symbolic
representation
logical
reasoning

The symbol grounding problem
Searle's “Chinese room argument” (1980):
– The symbols do not have the same semantics attached to
them as for the designer...
Harnad (1990)
– Cognition is more than symbol manipulation
→ In other words, the system should learn its own symbols,
grounded in its own experiences...
Barsalou (1999)
– Cognition is inherently perceptual
– (and therefore, perception is inherently cognitive)

The frame problem in AI – part I
(Daniel C. Dennett)
● Once upon a time,
there was a robot,
called R1...
"Cognitive Wheels: The Frame Problem of AI,"
in C. Hookway, ed., Minds, Machines and Evolution,
Cambridge University Press 1984, 129-151.
PULL WAGON

The frame problem in AI – part I
(Daniel C. Dennett)
● Once upon a time,
there was a robot,
called R1...
"Cognitive Wheels: The Frame Problem of AI,"
in C. Hookway, ed., Minds, Machines and Evolution,
Cambridge University Press 1984, 129-151.

The frame problem in AI – part II
(Daniel C. Dennett)
● A new robot was built
to recognise, and
handle side-effects:
R1D1
Pulling wagon does not change
wall colour
Pull the wagon?
Pulling the wagon does not
discharge the batteries
...
...

The frame problem in AI – part III
(Daniel C. Dennett)
● The designers built a
third robot to assess
the relevance of
implications: Say
hello to R2D1.
...

The frame problem in AI – part III
(Daniel C. Dennett)
● In sum, any action requires a large, a priori
unknown, amount of world knowledge
● Hard to predict for the system designer
● Hard to deduce by symbolically by the system
● → need for common sense associations
For vision: it is hard to predetermine a priori
the features and detectors that will be
required.

Issues with Symbolic AI
● Symbolic AI is an
efficient architecture
● Has solved successfully
some hard problems
● ...but faces some
complex limitations due
to the separation
between symbolic /
sub-symbolic
components.
Low level
signal processing
High-level cognition,
consciousness
Symbolic reasoning
detectors
symbols
Computer vision
AI

Emergent Cognition
● The system develops its own
epistemiology (set of symbols &
associations) from interacting with its
environment.
●
Enactive view (Maturana, H., Varela, F. The
Tree of Knowledge – The Biological Roots of
Human Understanding. New Science Library,
Boston & London (1987))
– autonomous system
– can affect the environment
– is affected by the environment
(embodied)
– self-organised and self-generated.
●
Central nervous system
– prediction & adaptation
Fig from Vernon, von Hofsten & Fadiga
“A Roadmap for Cognitive Development
in Humanoid Robots”. Springer, 2010.

Emergent Cognition: Shared
Epistemiology
● Pb: different experience → different symbols!
● Shared epistemiology comes from communication between
agents (my and your concept of “red” are shared, even if you're
colour blind)
● Note: communication between artificial systems can be a lot faster!

Artificial Neural Networks
● An “artificial neuron”
is in effect
– a linear
transformation
– a linear squashing
function s
an=f w ,b(x)=s(∑
i
wi xi+b)
x1
x2
x3
+1
n an
w1
w2
w3
b

Non-linearities
● Smooth squashing
functions.
● continuous and
differentiable.
● sigmoid → [0,1]
● tanh → [-1,+1]
s(x)=
1
1+e
−x
s(x)=tanh(x)=
e
x
−e
−x
e
x
+e
−x
s'(x)=(1−s(x))s(x)

Artificial Neural Network
(aka Multilayer perceptron)
x1
x2
x3
h1
r1
h2
+1+1
input layer
layer #1
(N^1=3 inputs)
“hidden” layer
layer #2
(N^2=2 nodes)
output layer
layer #3
(N^3=1 node)
θ=(W
1
,b
1
,W
2
,b
2
)
parameters:
f θ(x)=s( ∑
j∈[1,N
2
]
W j1
2
s( ∑
i∈[1, N
1
]
Wij
1
xi+bi
1
)+bj
2
)
zi
l+1
= ∑
j∈[1, N
l
]
W j1
l
aj
l
+bj
l
ai
l
=s(zi
l
)
Generic node activation:
b1
2

Learning by back-propagation
E=
1
2
∥a
3
− y∥
δj
L
=
∂ E
∂ aj
L
s'(zj
L
)
(⇔δj
L
=(aj
L
− y j)s'(zj
L
))
x1
x2
x3
h1
r1
h2
+1+1
input layer
layer #1
(N^1=3 inputs)
“hidden” layer
layer #2
(N^2=2 nodes)
output layer
layer #3
(N^3=1 node)
(x , y)
δj
l
=∑
i
W ji
l
δi
l+1
s'(zj
l
)
δ1
3
δ1
2
TOP LAYER ERROR
OTHER LAYERS ERROR
For a given datapoint with label
We have a error for the network
a1
3
b1
2

Learning by back-propagation
x1
x2
x3
h1
r1
h2
+1+1
input layer
layer #1
(N^1=3 inputs)
“hidden” layer
layer #2
(N^2=2 nodes)
output layer
layer #3
(N^3=1 node)
b1
2
δ1
3
δ1
2
a1
3
∂ E
∂Wij
l
=ai
l
δj
l+1
∂ E
∂bj
l
=δj
l+1
→ Update parameters
with gradient descent
Finally we get the error derivative for all
network parameters:

Embodiment
● Idea: Concepts can only be
learnt for and by a body
– → being affected by the
environment
– actions and perception and learnt
jointly.
– good perception is what allows
successful actions.
● Example of reaching with neural
network (Jamone, L.; Natale, L.; Metta,
G.; Nori, F.; Sandini, G. .2012, “Autonomous
Online Learning of Reaching Behavior in a
Humanoid Robot.” International Journal of
Humanoid Robotics 9(3), 2012.)

Do we need embodiment?
● If you buy the emergent thesis, it is required
– joint development of perception & action
– symbol grounding in experience
– → emergent epistemiology
● What type of embodiment?
– strong: physical body (or even organic body!)
– weak: a system coupled with its environment
●
it can affect its environment, and
● it is affected by it

Phylogeny vs. Ontogeny
● Phylogeny: the system's design (eg features like SIFT
or lines). High with cognitivist approach, more limited
in the emergent paradigm.
● Ontogeny: the system's development during its
lifetime, drawn from experiences with its environment.
● Challenges for artificial systems:
– hard to learn high level, abstract symbols autonomously.
– hard to generalise across experiences
– → how to learn abstract representations from experience?

Representation Learning
● Simple example: PCA
● Aim: identify dimensions that
vary jointly
● Components are axes of largest
variation.
● Linear transformation
● Orthogonal basis
● Applied on natural images,
generate filters similar to early
cortical cells (V1)
PJB Hancock, RJ Baddeley and LS Smith (1992)
The principal components of natural images
Network: computation in neural systems 3(1)
y=W
T
x+μ

Arguments for deep hierarchies
● Feature sharing at intermediate
levels → sub-linear coding and
computation requirements (Fidler,
Boben & Leonardis. Evaluating multi-class
learning strategies in a generative hierarchical
framework for object detection. NIPS'09.)
● compact coding (Bengio, Courville, &
Vincent. Representation Learning: A Review and
New Perspectives. IEEE PAMI 35(8), 2013.)
● → Human visual system, estimated
to have 5-10 levels (Krueger et al, “Deep
Hierarchies in the Primate Visual Cortex: What
Can We Learn for Computer Vision? 2013)
● →NN,CART,SVM → 2 layers
Figure from Fidler, Boben & Leonardis 2009

Arguments for Deep Hierarchies
● Problem with linear
representations :
– A combination of any
number of linear
representations is
also a linear
representation...
y=W1
T
x+μ1
z=W2
T
y+μ2
⇔ z=W2
T
W1
T
x+μ2+μ1
⇔ z=W3
T
x+μ3
x
y
(W1,μ1)
z
(W2, μ2)
(W3, μ3)

Data driven hierarchies:
Autoencoders
● Idea: learn jointly a
pair of mappings
and
● that minimises
information loss
● often using a neural
network formulation
z=ψ( y)y=ϕ(x)
argmin
ϕ ,ψ
∑
x
∥x−ψ(ϕ(x))∥D
ϕ(x)=s(W x+b)
ψ(x)=W ' x+b'
x
y
ϕ ψ
x1 x2 x3 x4
y1 y2 y3

Remember: ANNs
● An “artificial neuron”
is in effect
– a linear
transformation
– a linear squashing
function s
an=f w ,b(x)=s(∑
i
wi xi+b)
x1
x2
x3
+1
n an
w1
w2
w3
b

Data driven hierarchies:
Sparse Autoencoders
● Trivial solution when
dim(Y) >= dim(X) !
● But overcomplete bases
can be beneficial
(Olhsausen & Field 1996)
● Solution→sparse coding
argmin
ϕ ,ψ
∑
x
∥x−ψ(ϕ(x))∥D+g(ϕ(x))
x
y
ϕ ψ
Olshausen, B. and Field, D. (1996). Emergence of simple-cell receptive field properties by learning
a sparse code for natural images. Nature, 381(6583):607–609.

Stacked Auto-encoders
● you can stack
multiple layers of AE
● Trained layer-wise
● Note that the
structure is the same
as ANN.
● Can be fine-tuned
with backpropagation
x
h
ϕ1 ψ1
y
ϕ2 ψ2

Limitations of ANN
● Problem with ANN, doesn't work well with more than 2
layers
● pb. with backprop, probably the gradient gets too
diluted.
● Problem for emergent cognition: we want to learn
higher level of abstraction!
● More recently several alternatives have been
developped (Deep Learning): Restricted Boltzman
Machines (RBM), Stacked autoencoder, Convolutional
nets.

Today's hot topic:
Convolutional Neural Nets
● CNNs are neural nets (of
course)
● sparse connectivity
● shared weight →
convolutional.
● receptive field span all
input dimensions
● typically alternating layers
of convolution and
max-pooling
x1 x1 x1 x1 x1
h1 h1 h1
Fig from http://deeplearning.net/tutorial/lenet.html

CNNs (cont'd)
● ex: LeNet (LeCun et al, 1998)
● Alternating convolution & subsampling (ie, max-pooling) layes
● Top-layer is a typical ANN.
● Train using backprop & stochastic gradient descent
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, 1998.
Figure from http://deeplearning.net/tutorial/lenet.html

CNNs (cont'd)
● Pb: training deep networks is difficult with backprop (slow, requires LOTS of data)
● CNN do better (because sparse) but still a pb.
● → Unsupervised pre-training of the network
– using, ie, sparse autoencoders, layer-wise.
– refine the weights with supervised backprop afterwards.
● Top results in MNIST, ILSRVC, PASCAL VOC.

Other Part-based Hierarchies
● Deep belief networks (DBN): Restricted
Boltzmann Machines
(Hinton, Osindero, and Teh, “A Fast Learning Algorithm
for Deep Belief Nets,” Neural Computation 18, 2006.)
● Slow Feature Analysis (SFA) (Franzius,
Wilber and Wiskott. “Invariant object recognition and
pose estimation with slow feature analysis”. Neural
Computation, 2011)
● Compositional hierarchies (Fidler, Boben &
Leonardis. “Evaluating multi-class learning strategies in
a generative hierarchical framework for object
detection”. NIPS, 2009)
● → Good review by Yoshua Bengio:
Yoshua Bengio, Aaron Courville, and
Pascal Vincent. “Representation
Learning: A Review and New
Perspectives.” IEEE PAMI 35(8), 2013.
Fidler, S., M. Boben, and A. Leonardis 2009a.
“Learning hierarchical compositional representations of object structure.”
Pp. 196-215 in Object categorization : computer and human vision perspectives,
edited by Sven J Dickinson, Aleš Leonardis, Bernt Schiele, and Michael J Tarr.
New York: Cambridge University Press.

Summary and conclusions
● There is no delineation
between cognition and vision.
● Reasoning on hand-crafted
symbols may be inadequate
(semantic gap) or brittle.
● Learning abstraction is hard,
but possible using deep
hierarchies.
● Unsupervised pre-training for
deep hierarchies is critical →
tells us something about
cognition.

Cognitive Vision - After the hype

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (7)

Similaire à Cognitive Vision - After the hype

Similaire à Cognitive Vision - After the hype (20)

Plus de potaters

Plus de potaters (16)

Dernier

Dernier (20)

Cognitive Vision - After the hype