1. The document describes a neuromorphic approach to computer vision that aims to build computer vision systems based on the response properties of neurons in the ventral stream of the visual cortex.
2. It involves building a large-scale model of visual perception with 108 units that spans several areas of the visual cortex and combines forward and reverse engineering.
3. The model has been shown to be consistent with experimental data across areas of visual cortex and able to explain human performance in rapid categorization tasks.
OpenShift Commons Paris - Choose Your Own Observability Adventure
A neuromoprhic approach to computer vision
1. A Neuromorphic Approach
to Computer Vision
Thomas Serre & Tomaso Poggio
Center for Biological and Computational Learning
Computer Science and Artificial Intelligence Laboratory
McGovern Institute for Brain Research
Department of Brain & Cognitive Sciences
Massachusetts Institute of Technology
2. Past Neo2 team:
CalTech, Bremen & MIT
Tomaso Poggio, MIT
Bob Desimone, MIT
Christof Koch, CalTech
Expertise: Winrich Freiwald, Bremen
Computational neuroscience
Animal behavior
Neuronal recording in IT and V4 + fMRI in monkeys
Data processing
Access to human recordings
Multi electrodes
6. The problem: invariant
recognition in natural scenes
Object recognition is hard!
Our visual capabilities are
computationally amazing
Long-term goal: Reverse-
engineer the visual system
and build machines that
see and interpret the visual
world as well as we do
7. Neurally plausible quantitative
model of visual perception Model
layers
RF sizes Num.
units
Animal
Prefrontal 11, vs.
task-dependent learning
Cortex 46 8 45 12 13
non-animal classification 10 0
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
LIP,VIP,DP,7a
V1
AIT,36,35
PIT, AIT
TE
o 2
S4 7 10
STP
Rostral STS
}
TG 36 35
o
TPO PGa IPa TEa TEm C3 7 10 3
PG Cortex
task-independent learning
AIT
o
C2b 7 10 3
Unsupervised
o o
S3 1.2 - 3.2 10 4
DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o
S2b 0.9 - 4.4 10 7
o o
C2 1.1 - 3.0 10 5
o o
PO V3A MT V4 S2
0.6 - 2.4 10 7
o o
V2
V3
C1 0.4 - 1.6 10 4
o
V1 0.2o- 1.1 10 6
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes
8. Neurally plausible quantitative
model of visual perception Model
layers
RF sizes Num.
units
Animal
Prefrontal 11, vs.
task-dependent learning
Cortex 46 8 45 12 13
non-animal classification 10 0
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
Large-scale (108 units),
V2,V3,V4,MT,MST
LIP,VIP,DP,7a
V1
AIT,36,35
PIT, AIT
TE
spans several areas of the
o 2
S4 7 10
STP
Rostral STS
}
TG 36 35
visual cortex
o
TPO PGa IPa TEa TEm C3 7 10 3
PG Cortex
task-independent learning
AIT
o
C2b 7 10 3
Unsupervised
o o
S3 1.2 - 3.2 10 4
DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o
S2b 0.9 - 4.4 10 7
o o
C2 1.1 - 3.0 10 5
o o
PO V3A MT V4 S2
0.6 - 2.4 10 7
o o
V2
V3
C1 0.4 - 1.6 10 4
o
V1 0.2o- 1.1 10 6
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes
9. Neurally plausible quantitative
model of visual perception Model
layers
RF sizes Num.
units
Animal
Prefrontal 11, vs.
task-dependent learning
Cortex 46 8 45 12 13
non-animal classification 10 0
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
Large-scale (108 units),
V2,V3,V4,MT,MST
LIP,VIP,DP,7a
V1
AIT,36,35
PIT, AIT
TE
spans several areas of the
o 2
S4 7 10
STP
Rostral STS
}
TG 36 35
visual cortex
o
TPO PGa IPa TEa TEm C3 7 10 3
PG Cortex
task-independent learning
AIT
o
C2b 7 10 3
Unsupervised
Combination of forward
o o
S3 1.2 - 3.2 10 4
DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o
and reverse engineering
S2b 0.9 - 4.4 10 7
o o
C2 1.1 - 3.0 10 5
o o
PO V3A MT V4 S2
0.6 - 2.4 10 7
o o
V2
V3
C1 0.4 - 1.6 10 4
o
V1 0.2o- 1.1 10 6
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes
10. Neurally plausible quantitative
model of visual perception Model
layers
RF sizes Num.
units
Animal
Prefrontal 11, vs.
task-dependent learning
Cortex 46 8 45 12 13
non-animal classification 10 0
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
Large-scale (108 units),
V2,V3,V4,MT,MST
LIP,VIP,DP,7a
V1
AIT,36,35
PIT, AIT
TE
spans several areas of the
o 2
S4 7 10
STP
Rostral STS
}
TG 36 35
visual cortex
o
TPO PGa IPa TEa TEm C3 7 10 3
PG Cortex
task-independent learning
AIT
o
C2b 7 10 3
Unsupervised
Combination of forward
o o
S3 1.2 - 3.2 10 4
DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o
and reverse engineering
S2b 0.9 - 4.4 10 7
o o
C2 1.1 - 3.0 10 5
o o
0.6 - 2.4 10 7
Shown to be consistent
PO V3A MT V4 S2
o o
V2
V3
C1 0.4 - 1.6 10 4
V1
S1 with many experimental
0.2o- 1.1
o
10 6
dorsal stream
'where' pathway
ventral stream
'what' pathway
data across areas of visual
cortex
Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes
17. Model validation against
electrophysiology data
1 IT Model
0.8
Classification performance
0.6
0.4
0.2
0
Size: 3.4o 3.4o 1.7o 6.8o 3.4o 3.4o
Position: center center center center 2ohorz. 4ohorz.
TRAIN
Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005
Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005
20. Explaining human performance
in rapid categorization tasks
Head Close-body Medium-body Far-body
Animals
Serre Oliva & Poggio 2007 Natural
21. Explaining human performance
in rapid categorization tasks
2.6
2.4
Performance (d')
1.8
1.4
Model (82% correct)
1.0
Human observers (80% correct)
Head Close-body Medium-body Far-body
Head Close- Medium- Far-
body body body
Animals
Serre Oliva & Poggio 2007 Natural
22. Decoding animal category
from IT cortex
Recording site in monkey’s IT
Meyers Freiwald Embark Kreiman Serre Poggio in prep
23. Decoding animal category
from IT cortex
Model
IT neurons
Recording site in monkey’s IT fMRI
Meyers Freiwald Embark Kreiman Serre Poggio in prep
29. Bio-motivated computer
vision
Scene parsing and object recognition
Computer vision
system based on
the response
properties of
neurons in the
ventral stream of the
visual cortex
Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
Serre et al 2007
30. Bio-motivated computer
vision
Scene parsing and object recognition
Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
Serre et al 2007
31. Bio-motivated computer
vision
Scene parsing and object recognition
Gflops
Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
Serre et al 2007
32. Bio-motivated computer
vision
Scene parsing and object recognition Speed improvement since 2006
image size multi-thread GPU (cuda)
64x64 4.5x 14x
128x128 3.5x 14x
256x256 1.5x 17x
512x512 2.5x 25x
From ~1 min down to ~1 sec !!
Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
Serre et al 2007
36. Automatic recognition of
rodent behavior Performance
human
72%
agreement
proposed
71%
system
commercial
56%
system
chance 12%
Serre Jhuang Garrote Poggio Steele in prep
41. Neuroscience of attention
and Bayesian inference
PFC
IT
V4/PIT
integrated model of
V2 attention and recognition
in collaboration with Desimone lab (monkey electrophysiology)
42. Neuroscience of attention
and Bayesian inference
PFC
feature-based
attention
IT
V4/PIT
integrated model of
V2 attention and recognition
in collaboration with Desimone lab (monkey electrophysiology)
43. Neuroscience of attention
and Bayesian inference
PFC
feature-based
attention
IT
LIP/FEF
V4/PIT
spatial attention
integrated model of
V2 attention and recognition
in collaboration with Desimone lab (monkey electrophysiology)
44. Neuroscience of Attention
and Bayesian inference
PFC
feature-based
attention
IT
LIP/FEF
V4/PIT
spatial attention
V2
see also Rao 2005; Lee & Mumford 2003 Chikkerur Serre & Poggio in prep
45. Neuroscience of Attention
and Bayesian inference
PFC O
feature-based
object priors
attention
IT Fi
LIP/FEF L
V4/PIT Fli
spatial attention location priors
N
V2 I
see also Rao 2005; Lee & Mumford 2003 Chikkerur Serre & Poggio in prep
46. Model predicts well human
eye-movements
Integrating (local)
feature-based + (global)
context-based cues
accounts for 92% of
inter-subject agreement!
Chikkerur Tan Serre & Poggio in sub
47. Model performance
improves with attention
performance (d’)
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
48. Model performance
improves with attention
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
49. Model performance
improves with attention
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
50. Model performance
improves with attention
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
51. Model performance
improves with attention
mask no mask
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
53. Main Achievements in Neo2
Extended + extensively tested feedforward model on real-world recognition
tasks [Poggio]:
matches neural data
mimics human performance in rapid categorization
performs at the level of state-of-the-art computer vision systems
C++ software + interface available / 100x speed-up
combined with saliency algorithm + tested on real-time street surveillance
(video)
54. Main Achievements in Neo2
Extended + extensively tested feedforward model on real-world recognition
tasks [Poggio]:
matches neural data
mimics human performance in rapid categorization
performs at the level of state-of-the-art computer vision systems
C++ software + interface available / 100x speed-up
combined with saliency algorithm + tested on real-time street surveillance
(video)
Demonstrated read out of cluttered natural images from monkey fMRI and
physiology recordings in inferotemporal cortex [Freiwald and Poggio]:
first decoding of cluttered complex images
agreement with original feedforward model
55. Main Achievements in Neo2
Extended + extensively tested feedforward model on real-world recognition
tasks [Poggio]:
matches neural data
mimics human performance in rapid categorization
performs at the level of state-of-the-art computer vision systems
C++ software + interface available / 100x speed-up
combined with saliency algorithm + tested on real-time street surveillance
(video)
Demonstrated read out of cluttered natural images from monkey fMRI and
physiology recordings in inferotemporal cortex [Freiwald and Poggio]:
first decoding of cluttered complex images
agreement with original feedforward model
Characterized neural encoding in V4, IT and FEF under passive and task-
dependent viewing conditions [Desimone and Poggio]:
characterized the dynamics of bottom-up vs. top-down visual information
processing (characteristic timing signature of activity in V4 and IT vs. FEF)
top-down, task-dependent, attention modulates features in V4 and IT
57. Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
predicts well human gaze in natural images
significantly improves recognition performance of original model in
clutter
58. Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
predicts well human gaze in natural images
significantly improves recognition performance of original model in
clutter
Extended model for classification of video sequences (i.e., action
recognition) [Poggio]
tested on several video databases and shown to outperform previous
algorithms
59. Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
predicts well human gaze in natural images
significantly improves recognition performance of original model in
clutter
Extended model for classification of video sequences (i.e., action
recognition) [Poggio]
tested on several video databases and shown to outperform previous
algorithms
Demonstrated read-out from human medial temporal lobe (MTL) [Koch]
Decoding of natural scenes from single neurons in human MTL
Improved ability of saliency model to mimic human gaze patterns
60. Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
predicts well human gaze in natural images
significantly improves recognition performance of original model in
clutter
Extended model for classification of video sequences (i.e., action
recognition) [Poggio]
tested on several video databases and shown to outperform previous
algorithms
Demonstrated read-out from human medial temporal lobe (MTL) [Koch]
Decoding of natural scenes from single neurons in human MTL
Improved ability of saliency model to mimic human gaze patterns
Model used to transfer neuroscience data to biologically inspired vision
systems
61. MIT team:
Poggio, Desimone, Serre,
Future Directions
1-of-2 IT physiologist,
+ (Koch+Itti)
Develop new technologies to decode computations and
representations in the visual cortex:
62. MIT team:
Poggio, Desimone, Serre,
Future Directions
1-of-2 IT physiologist,
+ (Koch+Itti)
Develop new technologies to decode computations and
representations in the visual cortex:
Optical silencing and
circuits stimulation technology
based on X-rhodopsin
63. MIT team:
Poggio, Desimone, Serre,
Future Directions
1-of-2 IT physiologist,
+ (Koch+Itti)
Develop new technologies to decode computations and
representations in the visual cortex:
Optical silencing and
circuits stimulation technology
based on X-rhodopsin
Multi-electrode
network
technology
64. MIT team:
Poggio, Desimone, Serre,
Future Directions
1-of-2 IT physiologist,
+ (Koch+Itti)
Develop new technologies to decode computations and
representations in the visual cortex:
Optical silencing and
circuits stimulation technology
based on X-rhodopsin
Multi-electrode
network
technology
Simultaneous recordings
system
across areas
65. MIT team:
From the neuroscience Poggio, Desimone,
Serre, XXX
data towards a
system-level model of
natural vision
1. Clutter and image ambiguities: Attention and
cortical feedback
2. Learning and recognition of objects in video
sequences
67. Clutter and image ambiguities:
Attention and cortical feedback
Circuitry of attention and
role of synchronization in
top-down and bottom-up
search tasks: monkey
IT electrophysiology in V4, IT
and FEF
72. Past Neo2 team:
CalTech, Bremen & MIT
Tomaso Poggio, MIT
Bob Desimone, MIT
Christof Koch, CalTech
Winrich Freiwald, Bremen
73. IT readout improves with
attention
stim cue transient change
isolated object
+
object not shown
Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
74. IT readout improves with
attention
stim cue transient change
isolated object
+
attention away from object
object not shown
Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
75. IT readout improves with
attention
stim cue transient change
isolated object
+
attention away from object
object not shown
Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
76. MIT team:
IT readout improves Poggio, Desimone,
Serre, XXX
with attention
stim cue transient change
isolated object
attention on object
+
attention away from object
object not shown
Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
77. Two functional classes of cells to explain
invariant object recognition in the visual
cortex
Simple cells Complex cells
Template matching Invariance
Gaussian-like tuning max-like operation
~ “AND” ~”OR”
Riesenhuber & Poggio 1999 (building on Fukushima 1980 and Hubel & Wiesel 1962)
Editor's Notes
Here is the team that I am representing: Tomaso Poggio and Bob Desimone at MIT, Christof Koch at CalTech and Winrich Freiwald who used to be in Bremen now at CalTech and soon at Rockfeller.
Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system.
Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.
Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system.
Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.
Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system.
Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.
Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million).
The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision.
Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.
Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million).
The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision.
Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.
Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million).
The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision.
Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.
Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.
Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.
This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.
Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.
This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.
Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.
This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.
Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.
This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
Let me show you one example of some of the validation we have performed on this model. Here for instance we considered a small population of about 200 random model units in one of the top stages of the architecture I just presented. From this population activity we can try to readout the object category of stimuli that are presented to the model. In fact we can try to train a classifier with stimuli presented at one position and scale and see how well it generalizes to other position and scale. This tells you how much built-in invariance is built in the population of units. We get the results indicated here by the light gray bar plots corresponding to different amount of shifts in position and scale. You can play the same game on neurons in IT which is the highest purely visual area and has been critically linked with primates ability to recognize objects invariant of their position and scale. Here we found that the model was able to predict not only the overall level of performance but also the range of invariance to position and scale.
Another important validation is behavior assessed here using human psychophysics.
As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place.
An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.
Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
Another important validation is behavior assessed here using human psychophysics.
As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place.
An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.
Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
Another important validation is behavior assessed here using human psychophysics.
As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place.
An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.
Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
This dependency of human and the model performance in terms of clutter motivated a subsequent electrophysiology experiment that was done with Winrich Freiwald during the Neo2 project.
Here we found that this trend still holds for neurons in monkey IT cortex. Here we used fMRI to find areas that are differentially selective for animal vs. non-animal images. Winrich went on and recorded from a small pop of about 200 neurons in this area. You can see the readout results here on the right. We could reliably readout the animal category information from these difficult real-world images. Interestingly we found that there was also surprisingly high signal at the bold signal level (this is using a contrast agent).
More recently we gained access to a population of epileptic patients with intractable epilepsy and that are planned for resective surgery. Typically the patients spend about a week at the hospital with implanted electrodes. They are being monitored 24/7 to try to essentially triangulate the epileptic site. Here these patients are a unique opportunity to not only get behavioral measurements but also simultaneous intracranial recordings (here we measure local field potentials from iEEG). I should emphasize that the spatial and temporal resolution that we get is several orders of magnitude higher than what we could get with non-invasive imaging technique such as fMRI.
As an illustration, here is one electrode from one patient performing this animal vs non-animal categorization task. Here the electrode location has to be confirmed but is probably somewhere around the temporal lobe. Here you can see that already around 145 ms one can readout the presence or absence of an animal presented to the patient.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
In parallel we have used this model in real-world computer vision applications. For instance we have developed a computer vision system for the automatic parsing of street scene images. Here are examples of automatic parsing by the system overlaid over the original images. The colors and bounding boxes indicate predictions from the model (eg green for trees etc).
We have done a number of improvements in terms of the implementation of this model. The original matlab implementation of this model was quite slow...
We have been working on a number of ways to speed up this model. We started with efficient multi-threaded C/C++ implementation and finally went for exploiting the recent gains in computational power from graphics processing hardware (GPUs).
More recently we have extended the approach for the recognition of human actions such as running, walking, jogging, jumping, waving etc...
In all cases we have shown that the resulting biologically motivated computer vision systems were performing on par or better than state-of-the-art computer vision systems.
There are several other systems that
Let me switch gears and tell you a little bit about our work on attention. As I showed you earlier, one key limitation of this feedforward architecture is that it performs well for the recognition of objects when the objects to be recognized is large and the amount of background clutter is limited. I have shown you that consistent with human psychophysics and monkey electrophysiology the performance of the model decreases quite significantly when the amount of clutter increases.
Here we have been working with the assumption that the way the visual system overcome this limitation is via cortical feedback and shifts of attention. In particular our working hypothesis is that the role of spatial attention is to suppress the clutter so that the object of interest appears as if it were presented in isolation.
In collaboration with electrophysiology labs we are studying the circuits and networks of visual areas involved in attention.
In collaboration with electrophysiology labs we are studying the circuits and networks of visual areas involved in attention which involves a complex interaction between the ventral stream and area V4 in particular, prefrontal areas such as the FEF as well as the parietal cortex.
We had to perform two key extensions on this model.
First we have assumed that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas.
we also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence)
We had to perform two key extensions on this model.
First we have assumed that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas.
we also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence)
This attentional mechanisms can be casted in a probabilistic Bayesian framework whereby the parietal cortex represents Location variables, the ventral stream represents feature variables. These are our image fragments.
Variables for the target object are encoded in higher areas such as PFC...
This framework is inspired by an earlier model by Rao to explain spatial attention and is a special case of the computational model of the visual cortex described by David Mumford and that probably most of you know...
We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
Let me just summarize some of our main achievements from phase 0 of Neo2.
Let me just summarize some of our main achievements from phase 0 of Neo2.
Let me just summarize some of our main achievements from phase 0 of Neo2.
If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis:
In particular we need to be able to:
1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits.
2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access
3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis:
In particular we need to be able to:
1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits.
2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access
3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis:
In particular we need to be able to:
1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits.
2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access
3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
At the same time, these neuroscience data will allow us to not only validate but also extend existing models of the visual cortex and hopefully improve their recognition capabilities. In particular if we want to have computer systems that can compete with the primate visual system we need to go beyond rapid categorization tasks and study vision in more natural cases.
In particular, I think there are two key Neuroscience questions that need to be studied:
First as I eluded too already in this talk, cortical feedback and shifts of attention are likely to be the key computational mechanisms by which the visual system solves most of the difficulties inherent to vision namely dealing with significant amount of clutter as well as ambiguity in the visual input because of occlusion or low signal to noise.
The second one is the processing of image sequences not as a succession of independent snapshots as I showed you in the model of rapid object categorization but rather models that can exploit the temporal continuity of image sequences both for learning invariance to 2D transformations (zooming and looming, translation, 3D rotation etc) but also for the recognition of object in motion.
Along those lines we have started to make significant progress in understanding the circuitry of attention and in particular how spatial attention works to suppress the clutter in image displays of this kind.
The next step is obviously to move towards more natural stimulus presentations.
I think significant progress in computer vision will come from the use of video sequences and the exploitation of temporal continuity in those sequences.
Here is the way current computer vision systems treat the visual world: As a collection of independent frames. Obviously the visual world is much richer than that and time is obviously an important component of visual perception. Obviously babies do not learn to recognize giraffes via labeled examples of this kind. Instead this baby who is going to the zoo perhaps for the first time has access to a much richer information, whereby giraffes undergo transformations such as rotation in depth, looming or shifting on the retina in a smooth continuous way. It is our belief that by exploiting these principles we will be able to build better learning algorithms.
Most of the work in the areas of computer vision and visual neuroscience has focused on the recognition of isolated objects. However, vision is much more than just classification, as it involves interpreting, parsing and navigating in visual scenes. By just looking, a human observer could essentially answer an infinite number of questions about an image: for instance, about the location and the boundary of an object, how to grasp it or to navigate over it. These are essential problems for robotics applications, which in essence have remained unaddressed in the field of neuroscience.
Here is the team that I am representing: Tomaso Poggio and Bob Desimone at MIT, Christof Koch at CalTech and Winrich Freiwald who used to be in Bremen now at CalTech and soon at Rockfeller.
We have implemented the approach in the context of our animal search
model mostly improves on medium and far conditions
We have implemented the approach in the context of our animal search
model mostly improves on medium and far conditions
We have implemented the approach in the context of our animal search
model mostly improves on medium and far conditions
Computational considerations suggest that you need two types of operations and therefore functional classes of cells for invariant object recognition
The gaussian-bell tuning was motivated by a learning algorithm called Radial Basis Function while the max operation was motivated by the standard scanning approach in computer vision and theoretical arguments from signal processing.
The goal of the simple units is to increase the complexity of the representation. Here on this example by pooling together the activity of afferent units with different orientations via this Gaussian-like tuning. This Gaussian tuning is ubiquitous in the visual cortex from orientation tuning in V1 to tuning for complex objects around certain poses in IT.\\
The complex units pool together afferent units with the same preferred stimuli eg vertical bar but slightly different positions and scales. At the complex unit level we thus build some tolerance with respect to the exact position and scale of the stimulus within the receptive field of the unit.