SlideShare a Scribd company logo
1 of 77
Download to read offline
A Neuromorphic Approach
to Computer Vision
Thomas Serre & Tomaso Poggio


Center for Biological and Computational Learning
Computer Science and Artificial Intelligence Laboratory
McGovern Institute for Brain Research
Department of Brain & Cognitive Sciences
Massachusetts Institute of Technology
Past Neo2 team:
CalTech, Bremen & MIT
                                 Tomaso Poggio, MIT
                                  Bob Desimone, MIT
                              Christof Koch, CalTech
Expertise:                  Winrich Freiwald, Bremen
 Computational neuroscience
 Animal behavior
 Neuronal recording in IT and V4 + fMRI in monkeys
 Data processing
 Access to human recordings
 Multi electrodes
The problem: invariant
recognition in natural scenes
The problem: invariant
recognition in natural scenes
                Object recognition is hard!
The problem: invariant
recognition in natural scenes
                Object recognition is hard!
                Our visual capabilities are
                computationally amazing
The problem: invariant
recognition in natural scenes
                Object recognition is hard!
                Our visual capabilities are
                computationally amazing
                Long-term goal: Reverse-
                engineer the visual system
                and build machines that
                see and interpret the visual
                world as well as we do
Neurally plausible quantitative
                   model of visual perception                                                                                                       Model
                                                                                                                                                    layers
                                                                                                                                                               RF sizes              Num.
                                                                                                                                                                                     units
                                                                                                                                         Animal
                   Prefrontal                                                                11,                                           vs.




                                                                                                                                                                                                task-dependent learning
                    Cortex                       46                 8      45 12             13
                                                                                                                                       non-animal   classification                    10 0
                                                                                                                                                        units




                                                                                                                                                                                                      Supervised



                                                                                                                                                                                                                           Increase in complexity (number of subunits), RF size and invariance
                                                                                                         PG
                                                         V2,V3,V4,MT,MST
                                        LIP,VIP,DP,7a




                                                                                                    V1
                                                                                      AIT,36,35
                                                                           PIT, AIT




                                                                                                              TE
                                                                                                                                                                            o              2
                                                                                                                                                     S4                 7             10

                                                        STP
                          Rostral STS


                                         }




                                                                                                                          TG   36 35
                                                                                                                                                                            o
                                        TPO PGa IPa TEa TEm                                                                                          C3                 7             10 3
       PG Cortex




                                                                                                                                                                                               task-independent learning
                                                                                                                    AIT
                                                                                                                                                                            o
                                                                                                                                                     C2b                7             10 3




                                                                                                                                                                                                     Unsupervised
                                                                                                                                                                        o        o
                                                                                                                                                     S3              1.2 - 3.2        10 4

DP   VIP LIP 7a PP MSTcMSTp                                     FST                                           PIT    TF                                                 o       o
                                                                                                                                                     S2b             0.9 - 4.4        10 7

                                                                                                                                                                        o       o
                                                                                                                                                     C2              1.1 - 3.0        10 5

                                                                                                                                                                        o       o
                                                         PO                V3A           MT               V4                                         S2
                                                                                                                                                                     0.6 - 2.4        10 7

                                                                                                                                                                        o       o
                                                                                                   V2
                                                                                                         V3
                                                                                                                                                     C1              0.4 - 1.6        10 4

                                                                                                                                                                                o
                                                                                                   V1                                                                0.2o- 1.1        10 6
                                                                                                                                                     S1



                          dorsal stream                                                             ventral stream
                         'where' pathway                                                            'what' pathway

                                                                                                                                                           Simple cells
                                                                                                                                                           Complex cells
                                                                                                                                                           Tuning               Main routes
                                                                                                                                                           MAX                  Bypass routes
Neurally plausible quantitative
                   model of visual perception                                                                                                       Model
                                                                                                                                                    layers
                                                                                                                                                               RF sizes              Num.
                                                                                                                                                                                     units
                                                                                                                                         Animal
                   Prefrontal                                                                11,                                           vs.




                                                                                                                                                                                                task-dependent learning
                    Cortex                       46                 8      45 12             13
                                                                                                                                       non-animal   classification                    10 0
                                                                                                                                                        units




                                                                                                                                                                                                      Supervised



                                                                                                                                                                                                                           Increase in complexity (number of subunits), RF size and invariance
                                                                                                         PG



                                                                                                                                                              Large-scale (108 units),
                                                         V2,V3,V4,MT,MST
                                        LIP,VIP,DP,7a




                                                                                                    V1
                                                                                      AIT,36,35
                                                                           PIT, AIT




                                                                                                              TE




                                                                                                                                                              spans several areas of the
                                                                                                                                                                            o              2
                                                                                                                                                     S4                 7             10

                                                        STP
                          Rostral STS


                                         }




                                                                                                                          TG   36 35



                                                                                                                                                              visual cortex
                                                                                                                                                                            o
                                        TPO PGa IPa TEa TEm                                                                                          C3                 7             10 3
       PG Cortex




                                                                                                                                                                                               task-independent learning
                                                                                                                    AIT
                                                                                                                                                                            o
                                                                                                                                                     C2b                7             10 3




                                                                                                                                                                                                     Unsupervised
                                                                                                                                                                        o        o
                                                                                                                                                     S3              1.2 - 3.2        10 4

DP   VIP LIP 7a PP MSTcMSTp                                     FST                                           PIT    TF                                                 o       o
                                                                                                                                                     S2b             0.9 - 4.4        10 7

                                                                                                                                                                        o       o
                                                                                                                                                     C2              1.1 - 3.0        10 5

                                                                                                                                                                        o       o
                                                         PO                V3A           MT               V4                                         S2
                                                                                                                                                                     0.6 - 2.4        10 7

                                                                                                                                                                        o       o
                                                                                                   V2
                                                                                                         V3
                                                                                                                                                     C1              0.4 - 1.6        10 4

                                                                                                                                                                                o
                                                                                                   V1                                                                0.2o- 1.1        10 6
                                                                                                                                                     S1



                          dorsal stream                                                             ventral stream
                         'where' pathway                                                            'what' pathway

                                                                                                                                                           Simple cells
                                                                                                                                                           Complex cells
                                                                                                                                                           Tuning               Main routes
                                                                                                                                                           MAX                  Bypass routes
Neurally plausible quantitative
                   model of visual perception                                                                                                       Model
                                                                                                                                                    layers
                                                                                                                                                               RF sizes              Num.
                                                                                                                                                                                     units
                                                                                                                                         Animal
                   Prefrontal                                                                11,                                           vs.




                                                                                                                                                                                                task-dependent learning
                    Cortex                       46                 8      45 12             13
                                                                                                                                       non-animal   classification                    10 0
                                                                                                                                                        units




                                                                                                                                                                                                      Supervised



                                                                                                                                                                                                                           Increase in complexity (number of subunits), RF size and invariance
                                                                                                         PG



                                                                                                                                                              Large-scale (108 units),
                                                         V2,V3,V4,MT,MST
                                        LIP,VIP,DP,7a




                                                                                                    V1
                                                                                      AIT,36,35
                                                                           PIT, AIT




                                                                                                              TE




                                                                                                                                                              spans several areas of the
                                                                                                                                                                            o              2
                                                                                                                                                     S4                 7             10

                                                        STP
                          Rostral STS


                                         }




                                                                                                                          TG   36 35



                                                                                                                                                              visual cortex
                                                                                                                                                                            o
                                        TPO PGa IPa TEa TEm                                                                                          C3                 7             10 3
       PG Cortex




                                                                                                                                                                                               task-independent learning
                                                                                                                    AIT
                                                                                                                                                                            o
                                                                                                                                                     C2b                7             10 3




                                                                                                                                                                                                     Unsupervised
                                                                                                                                                              Combination of forward
                                                                                                                                                                        o        o
                                                                                                                                                     S3              1.2 - 3.2        10 4

DP   VIP LIP 7a PP MSTcMSTp                                     FST                                           PIT    TF                                                 o       o



                                                                                                                                                              and reverse engineering
                                                                                                                                                     S2b             0.9 - 4.4        10 7

                                                                                                                                                                        o       o
                                                                                                                                                     C2              1.1 - 3.0        10 5

                                                                                                                                                                        o       o
                                                         PO                V3A           MT               V4                                         S2
                                                                                                                                                                     0.6 - 2.4        10 7

                                                                                                                                                                        o       o
                                                                                                   V2
                                                                                                         V3
                                                                                                                                                     C1              0.4 - 1.6        10 4

                                                                                                                                                                                o
                                                                                                   V1                                                                0.2o- 1.1        10 6
                                                                                                                                                     S1



                          dorsal stream                                                             ventral stream
                         'where' pathway                                                            'what' pathway

                                                                                                                                                           Simple cells
                                                                                                                                                           Complex cells
                                                                                                                                                           Tuning               Main routes
                                                                                                                                                           MAX                  Bypass routes
Neurally plausible quantitative
                   model of visual perception                                                                                                       Model
                                                                                                                                                    layers
                                                                                                                                                               RF sizes              Num.
                                                                                                                                                                                     units
                                                                                                                                         Animal
                   Prefrontal                                                                11,                                           vs.




                                                                                                                                                                                                task-dependent learning
                    Cortex                       46                 8      45 12             13
                                                                                                                                       non-animal   classification                    10 0
                                                                                                                                                        units




                                                                                                                                                                                                      Supervised



                                                                                                                                                                                                                           Increase in complexity (number of subunits), RF size and invariance
                                                                                                         PG



                                                                                                                                                              Large-scale (108 units),
                                                         V2,V3,V4,MT,MST
                                        LIP,VIP,DP,7a




                                                                                                    V1
                                                                                      AIT,36,35
                                                                           PIT, AIT




                                                                                                              TE




                                                                                                                                                              spans several areas of the
                                                                                                                                                                            o              2
                                                                                                                                                     S4                 7             10

                                                        STP
                          Rostral STS


                                         }




                                                                                                                          TG   36 35



                                                                                                                                                              visual cortex
                                                                                                                                                                            o
                                        TPO PGa IPa TEa TEm                                                                                          C3                 7             10 3
       PG Cortex




                                                                                                                                                                                               task-independent learning
                                                                                                                    AIT
                                                                                                                                                                            o
                                                                                                                                                     C2b                7             10 3




                                                                                                                                                                                                     Unsupervised
                                                                                                                                                              Combination of forward
                                                                                                                                                                        o        o
                                                                                                                                                     S3              1.2 - 3.2        10 4

DP   VIP LIP 7a PP MSTcMSTp                                     FST                                           PIT    TF                                                 o       o



                                                                                                                                                              and reverse engineering
                                                                                                                                                     S2b             0.9 - 4.4        10 7

                                                                                                                                                                        o       o
                                                                                                                                                     C2              1.1 - 3.0        10 5

                                                                                                                                                                        o       o
                                                                                                                                                                     0.6 - 2.4        10 7

                                                                                                                                                              Shown to be consistent
                                                         PO                V3A           MT               V4                                         S2

                                                                                                                                                                        o       o
                                                                                                   V2
                                                                                                         V3
                                                                                                                                                     C1              0.4 - 1.6        10 4


                                                                                                   V1
                                                                                                                                                     S1       with many experimental
                                                                                                                                                                     0.2o- 1.1
                                                                                                                                                                                o
                                                                                                                                                                                      10 6



                          dorsal stream
                         'where' pathway
                                                                                                    ventral stream
                                                                                                    'what' pathway
                                                                                                                                                              data across areas of visual
                                                                                                                                                              cortex
                                                                                                                                                           Simple cells
                                                                                                                                                           Complex cells
                                                                                                                                                           Tuning               Main routes
                                                                                                                                                           MAX                  Bypass routes
Feedforward processing and
rapid recognition
Feedforward processing and
rapid recognition
Feedforward processing and
rapid recognition
Feedforward processing and
rapid recognition
Feedforward processing and
   rapid recognition
             category
             selective
               units
  linear
perceptron
Model validation against
electrophysiology data
Model validation against
  electrophysiology data

                                   1              IT           Model

                                  0.8
     Classification performance




                                  0.6


                                  0.4


                                  0.2


                  0
               Size: 3.4o                3.4o     1.7o     6.8o   3.4o    3.4o
            Position: center            center   center   center 2ohorz. 4ohorz.


     TRAIN

Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005
Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005
Explaining human performance
    in rapid categorization tasks




Serre Oliva & Poggio 2007
Explaining human performance
    in rapid categorization tasks




Serre Oliva & Poggio 2007
Explaining human performance
    in rapid categorization tasks




                                      Head   Close-body   Medium-body   Far-body


                            Animals




Serre Oliva & Poggio 2007   Natural
Explaining human performance
    in rapid categorization tasks
                                                2.6



                                                2.4




                             Performance (d')
                                                1.8



                                                1.4
                                                             Model (82% correct)
                                                1.0
                                                             Human observers (80% correct)

                                                      Head    Close-body   Medium-body Far-body
                                                      Head     Close-      Medium- Far-
                                                               body         body      body
                            Animals




Serre Oliva & Poggio 2007   Natural
Decoding animal category
    from IT cortex




         Recording site in monkey’s IT




Meyers Freiwald Embark Kreiman Serre Poggio in prep
Decoding animal category
    from IT cortex

                                                      Model


                                                       IT neurons



         Recording site in monkey’s IT                fMRI




Meyers Freiwald Embark Kreiman Serre Poggio in prep
Decoding animal
category from IT
cortex in humans
Decoding animal
category from IT
cortex in humans

    ~145 ms   Animal




                       Non-animal
Decoding animal
category from IT
cortex in humans
Decoding animal
category from IT
cortex in humans
Decoding animal
category from IT
cortex in humans
Bio-motivated computer
 vision
  Scene parsing and object recognition



                                                    Computer vision
                                                   system based on
                                                     the response
                                                      properties of
                                                     neurons in the
                                                  ventral stream of the
                                                      visual cortex
Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
                 Serre et al 2007
Bio-motivated computer
 vision
  Scene parsing and object recognition




Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
                 Serre et al 2007
Bio-motivated computer
 vision
  Scene parsing and object recognition




                                                  Gflops




Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
                 Serre et al 2007
Bio-motivated computer
 vision
  Scene parsing and object recognition               Speed improvement since 2006




                                                  image size   multi-thread   GPU (cuda)

                                                    64x64         4.5x           14x

                                                  128x128         3.5x           14x

                                                  256x256         1.5x           17x

                                                  512x512         2.5x           25x


                                                      From ~1 min down to ~1 sec !!
Serre Wolf & Poggio 2005; Wolf & Bileschi 2006;
                 Serre et al 2007
Bio-motivated computer
vision
Action recognition in video sequences    motion-sensitive MT-like units


wave 2             bend       jump 2




            side
  jack                          wave 1



  walk
                   jump
                                run




     Jhuang Serre Wolf & Poggio 2007
Recognition accuracy
                                                Dollar et
                                                              model    chance
                                                 al ‘05

                 KTH Human                        81.3%       91.6%    16.7%


                Weiz. Human                       86.7%       96.3%    11.1%


                  UCSD Mice                       75.6%       79.0%    20.0%


★ Cross-validation: 2/3   training, 1/3 testing, 10 repeats       Jhuang Serre Wolf & Poggio ICCV’07
Automatic recognition of
rodent behavior




                 Serre Jhuang Garrote Poggio Steele in prep
Automatic recognition of
rodent behavior    Performance
                             human
                                              72%
                           agreement

                            proposed
                                              71%
                             system

                          commercial
                                              56%
                            system


                             chance           12%


                   Serre Jhuang Garrote Poggio Steele in prep
Neuroscience of attention
and Bayesian inference
Neuroscience of attention
and Bayesian inference
Neuroscience of attention
and Bayesian inference
Neuroscience of attention
and Bayesian inference




                  integrated model of
               attention and recognition
Neuroscience of attention
and Bayesian inference
        PFC




         IT




       V4/PIT


                                   integrated model of
        V2                      attention and recognition
              in collaboration with Desimone lab (monkey electrophysiology)
Neuroscience of attention
and Bayesian inference
        PFC

                feature-based
                   attention

         IT




       V4/PIT


                                   integrated model of
        V2                      attention and recognition
              in collaboration with Desimone lab (monkey electrophysiology)
Neuroscience of attention
and Bayesian inference
                      PFC

                              feature-based
                                 attention

                       IT
     LIP/FEF


                     V4/PIT
 spatial attention

                                                 integrated model of
                      V2                      attention and recognition
                            in collaboration with Desimone lab (monkey electrophysiology)
Neuroscience of Attention
     and Bayesian inference
                               PFC

                                        feature-based
                                           attention

                                IT
             LIP/FEF


                              V4/PIT
         spatial attention



                                V2

see also Rao 2005; Lee & Mumford 2003                   Chikkerur Serre & Poggio in prep
Neuroscience of Attention
     and Bayesian inference
                               PFC                                        O

                                        feature-based
                                                                                 object priors
                                           attention

                                IT                                        Fi
             LIP/FEF                                      L


                              V4/PIT                                      Fli
         spatial attention                          location priors
                                                                                N



                                V2                                         I

see also Rao 2005; Lee & Mumford 2003                                 Chikkerur Serre & Poggio in prep
Model predicts well human
eye-movements

                Integrating (local)
                feature-based + (global)
                context-based cues
                accounts for 92% of
                inter-subject agreement!




                   Chikkerur Tan Serre & Poggio in sub
Model performance
improves with attention




                performance (d’)
                                                   one shift of
                                   no attention
                                                    attention

                                    Model             Humans




                                   Chikkerur Serre & Poggio in prep
Model performance
improves with attention
                                   3




                performance (d’)
                                   2

                                   1

                                   0
                                                       one shift of
                                       no attention
                                                        attention

                                        Model             Humans




                                       Chikkerur Serre & Poggio in prep
Model performance
improves with attention
                                   3




                performance (d’)
                                   2

                                   1

                                   0
                                                       one shift of
                                       no attention
                                                        attention

                                        Model             Humans




                                       Chikkerur Serre & Poggio in prep
Model performance
improves with attention
                                   3




                performance (d’)
                                   2

                                   1

                                   0
                                                       one shift of
                                       no attention
                                                        attention

                                        Model             Humans




                                       Chikkerur Serre & Poggio in prep
Model performance
improves with attention
                                       mask             no mask

                                   3




                performance (d’)
                                   2

                                   1

                                   0
                                                       one shift of
                                       no attention
                                                        attention

                                        Model             Humans




                                       Chikkerur Serre & Poggio in prep
Main Achievements in Neo2
Main Achievements in Neo2
Extended + extensively tested feedforward model on real-world recognition
tasks [Poggio]:
   matches neural data
   mimics human performance in rapid categorization
   performs at the level of state-of-the-art computer vision systems
   C++ software + interface available / 100x speed-up
   combined with saliency algorithm + tested on real-time street surveillance
   (video)
Main Achievements in Neo2
Extended + extensively tested feedforward model on real-world recognition
tasks [Poggio]:
   matches neural data
   mimics human performance in rapid categorization
   performs at the level of state-of-the-art computer vision systems
   C++ software + interface available / 100x speed-up
   combined with saliency algorithm + tested on real-time street surveillance
   (video)
Demonstrated read out of cluttered natural images from monkey fMRI and
physiology recordings in inferotemporal cortex [Freiwald and Poggio]:
   first decoding of cluttered complex images
   agreement with original feedforward model
Main Achievements in Neo2
Extended + extensively tested feedforward model on real-world recognition
tasks [Poggio]:
   matches neural data
   mimics human performance in rapid categorization
   performs at the level of state-of-the-art computer vision systems
   C++ software + interface available / 100x speed-up
   combined with saliency algorithm + tested on real-time street surveillance
   (video)
Demonstrated read out of cluttered natural images from monkey fMRI and
physiology recordings in inferotemporal cortex [Freiwald and Poggio]:
   first decoding of cluttered complex images
   agreement with original feedforward model
Characterized neural encoding in V4, IT and FEF under passive and task-
dependent viewing conditions [Desimone and Poggio]:
   characterized the dynamics of bottom-up vs. top-down visual information
   processing (characteristic timing signature of activity in V4 and IT vs. FEF)
   top-down, task-dependent, attention modulates features in V4 and IT
Main Achievements in Neo2
Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
  predicts well human gaze in natural images
  significantly improves recognition performance of original model in
  clutter
Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
   predicts well human gaze in natural images
   significantly improves recognition performance of original model in
   clutter
Extended model for classification of video sequences (i.e., action
recognition) [Poggio]
   tested on several video databases and shown to outperform previous
   algorithms
Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
   predicts well human gaze in natural images
   significantly improves recognition performance of original model in
   clutter
Extended model for classification of video sequences (i.e., action
recognition) [Poggio]
   tested on several video databases and shown to outperform previous
   algorithms
Demonstrated read-out from human medial temporal lobe (MTL) [Koch]
   Decoding of natural scenes from single neurons in human MTL
   Improved ability of saliency model to mimic human gaze patterns
Main Achievements in Neo2
Implemented new extended model suggested by these neuroscience
data from Desimone lab to include attention via feedback loops from
higher areas [Poggio]
   predicts well human gaze in natural images
   significantly improves recognition performance of original model in
   clutter
Extended model for classification of video sequences (i.e., action
recognition) [Poggio]
   tested on several video databases and shown to outperform previous
   algorithms
Demonstrated read-out from human medial temporal lobe (MTL) [Koch]
   Decoding of natural scenes from single neurons in human MTL
   Improved ability of saliency model to mimic human gaze patterns
Model used to transfer neuroscience data to biologically inspired vision
systems
MIT team:
                                     Poggio, Desimone, Serre,

Future Directions
                                      1-of-2 IT physiologist,
                                           + (Koch+Itti)


Develop new technologies to decode computations and
representations in the visual cortex:
MIT team:
                                     Poggio, Desimone, Serre,

Future Directions
                                      1-of-2 IT physiologist,
                                           + (Koch+Itti)


Develop new technologies to decode computations and
representations in the visual cortex:
                             Optical silencing and
   circuits                  stimulation technology
                             based on X-rhodopsin
MIT team:
                                      Poggio, Desimone, Serre,

Future Directions
                                       1-of-2 IT physiologist,
                                            + (Koch+Itti)


Develop new technologies to decode computations and
representations in the visual cortex:
                             Optical silencing and
   circuits                  stimulation technology
                             based on X-rhodopsin

                         Multi-electrode
  network
                         technology
MIT team:
                                      Poggio, Desimone, Serre,

Future Directions
                                       1-of-2 IT physiologist,
                                            + (Koch+Itti)


Develop new technologies to decode computations and
representations in the visual cortex:
                             Optical silencing and
   circuits                  stimulation technology
                             based on X-rhodopsin

                         Multi-electrode
  network
                         technology


                         Simultaneous recordings
  system
                         across areas
MIT team:
From the neuroscience                   Poggio, Desimone,
                                           Serre, XXX
data towards a
system-level model of
natural vision
   1. Clutter and image ambiguities: Attention and
     cortical feedback
   2. Learning and recognition of objects in video
     sequences
Clutter and image ambiguities:
Attention and cortical feedback




      IT
Clutter and image ambiguities:
Attention and cortical feedback


               Circuitry of attention and
               role of synchronization in
               top-down and bottom-up
               search tasks: monkey
      IT       electrophysiology in V4, IT
               and FEF
Clutter and image ambiguities:
Attention and cortical feedback



                   +

      IT
Learning and recognition of
objects in video sequences

How current computer
                        How brains learn
 vision systems learn
Learning and recognition of
objects in video sequences

How current computer
                        How brains learn
 vision systems learn
Thank you!
Past Neo2 team:
CalTech, Bremen & MIT

Tomaso Poggio, MIT
Bob Desimone, MIT
Christof Koch, CalTech
Winrich Freiwald, Bremen
IT readout improves with
    attention
                                                stim     cue        transient change
                                               isolated object




      +


                                                                 object not shown




Zhang Meyers Serre Bichot Desimone Poggio in prep                             n=67
IT readout improves with
    attention
                                                stim     cue        transient change
                                               isolated object




      +
                                                         attention away from object

                                                                 object not shown




Zhang Meyers Serre Bichot Desimone Poggio in prep                             n=67
IT readout improves with
    attention
                                                stim     cue        transient change
                                               isolated object




      +
                                                         attention away from object

                                                                 object not shown




Zhang Meyers Serre Bichot Desimone Poggio in prep                             n=67
MIT team:
    IT readout improves                                          Poggio, Desimone,
                                                                    Serre, XXX
    with attention
                                                stim     cue        transient change
                                               isolated object
                                                           attention on object



      +
                                                         attention away from object

                                                                 object not shown




Zhang Meyers Serre Bichot Desimone Poggio in prep                                n=67
Two functional classes of cells to explain
invariant object recognition in the visual
cortex
   Simple cells                                      Complex cells




    Template matching                                        Invariance
    Gaussian-like tuning                                  max-like operation
         ~ “AND”                                                ~”OR”

               Riesenhuber & Poggio 1999 (building on Fukushima 1980 and Hubel & Wiesel 1962)

More Related Content

Recently uploaded

Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 

Recently uploaded (20)

Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 

A neuromoprhic approach to computer vision

  • 1. A Neuromorphic Approach to Computer Vision Thomas Serre & Tomaso Poggio Center for Biological and Computational Learning Computer Science and Artificial Intelligence Laboratory McGovern Institute for Brain Research Department of Brain & Cognitive Sciences Massachusetts Institute of Technology
  • 2. Past Neo2 team: CalTech, Bremen & MIT Tomaso Poggio, MIT Bob Desimone, MIT Christof Koch, CalTech Expertise: Winrich Freiwald, Bremen Computational neuroscience Animal behavior Neuronal recording in IT and V4 + fMRI in monkeys Data processing Access to human recordings Multi electrodes
  • 4. The problem: invariant recognition in natural scenes Object recognition is hard!
  • 5. The problem: invariant recognition in natural scenes Object recognition is hard! Our visual capabilities are computationally amazing
  • 6. The problem: invariant recognition in natural scenes Object recognition is hard! Our visual capabilities are computationally amazing Long-term goal: Reverse- engineer the visual system and build machines that see and interpret the visual world as well as we do
  • 7. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE o 2 S4 7 10 STP Rostral STS } TG 36 35 o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o PO V3A MT V4 S2 0.6 - 2.4 10 7 o o V2 V3 C1 0.4 - 1.6 10 4 o V1 0.2o- 1.1 10 6 S1 dorsal stream ventral stream 'where' pathway 'what' pathway Simple cells Complex cells Tuning Main routes MAX Bypass routes
  • 8. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG Large-scale (108 units), V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE spans several areas of the o 2 S4 7 10 STP Rostral STS } TG 36 35 visual cortex o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o PO V3A MT V4 S2 0.6 - 2.4 10 7 o o V2 V3 C1 0.4 - 1.6 10 4 o V1 0.2o- 1.1 10 6 S1 dorsal stream ventral stream 'where' pathway 'what' pathway Simple cells Complex cells Tuning Main routes MAX Bypass routes
  • 9. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG Large-scale (108 units), V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE spans several areas of the o 2 S4 7 10 STP Rostral STS } TG 36 35 visual cortex o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised Combination of forward o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o and reverse engineering S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o PO V3A MT V4 S2 0.6 - 2.4 10 7 o o V2 V3 C1 0.4 - 1.6 10 4 o V1 0.2o- 1.1 10 6 S1 dorsal stream ventral stream 'where' pathway 'what' pathway Simple cells Complex cells Tuning Main routes MAX Bypass routes
  • 10. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG Large-scale (108 units), V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE spans several areas of the o 2 S4 7 10 STP Rostral STS } TG 36 35 visual cortex o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised Combination of forward o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o and reverse engineering S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o 0.6 - 2.4 10 7 Shown to be consistent PO V3A MT V4 S2 o o V2 V3 C1 0.4 - 1.6 10 4 V1 S1 with many experimental 0.2o- 1.1 o 10 6 dorsal stream 'where' pathway ventral stream 'what' pathway data across areas of visual cortex Simple cells Complex cells Tuning Main routes MAX Bypass routes
  • 15. Feedforward processing and rapid recognition category selective units linear perceptron
  • 17. Model validation against electrophysiology data 1 IT Model 0.8 Classification performance 0.6 0.4 0.2 0 Size: 3.4o 3.4o 1.7o 6.8o 3.4o 3.4o Position: center center center center 2ohorz. 4ohorz. TRAIN Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005
  • 18. Explaining human performance in rapid categorization tasks Serre Oliva & Poggio 2007
  • 19. Explaining human performance in rapid categorization tasks Serre Oliva & Poggio 2007
  • 20. Explaining human performance in rapid categorization tasks Head Close-body Medium-body Far-body Animals Serre Oliva & Poggio 2007 Natural
  • 21. Explaining human performance in rapid categorization tasks 2.6 2.4 Performance (d') 1.8 1.4 Model (82% correct) 1.0 Human observers (80% correct) Head Close-body Medium-body Far-body Head Close- Medium- Far- body body body Animals Serre Oliva & Poggio 2007 Natural
  • 22. Decoding animal category from IT cortex Recording site in monkey’s IT Meyers Freiwald Embark Kreiman Serre Poggio in prep
  • 23. Decoding animal category from IT cortex Model IT neurons Recording site in monkey’s IT fMRI Meyers Freiwald Embark Kreiman Serre Poggio in prep
  • 24. Decoding animal category from IT cortex in humans
  • 25. Decoding animal category from IT cortex in humans ~145 ms Animal Non-animal
  • 26. Decoding animal category from IT cortex in humans
  • 27. Decoding animal category from IT cortex in humans
  • 28. Decoding animal category from IT cortex in humans
  • 29. Bio-motivated computer vision Scene parsing and object recognition Computer vision system based on the response properties of neurons in the ventral stream of the visual cortex Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
  • 30. Bio-motivated computer vision Scene parsing and object recognition Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
  • 31. Bio-motivated computer vision Scene parsing and object recognition Gflops Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
  • 32. Bio-motivated computer vision Scene parsing and object recognition Speed improvement since 2006 image size multi-thread GPU (cuda) 64x64 4.5x 14x 128x128 3.5x 14x 256x256 1.5x 17x 512x512 2.5x 25x From ~1 min down to ~1 sec !! Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
  • 33. Bio-motivated computer vision Action recognition in video sequences motion-sensitive MT-like units wave 2 bend jump 2 side jack wave 1 walk jump run Jhuang Serre Wolf & Poggio 2007
  • 34. Recognition accuracy Dollar et model chance al ‘05 KTH Human 81.3% 91.6% 16.7% Weiz. Human 86.7% 96.3% 11.1% UCSD Mice 75.6% 79.0% 20.0% ★ Cross-validation: 2/3 training, 1/3 testing, 10 repeats Jhuang Serre Wolf & Poggio ICCV’07
  • 35. Automatic recognition of rodent behavior Serre Jhuang Garrote Poggio Steele in prep
  • 36. Automatic recognition of rodent behavior Performance human 72% agreement proposed 71% system commercial 56% system chance 12% Serre Jhuang Garrote Poggio Steele in prep
  • 37. Neuroscience of attention and Bayesian inference
  • 38. Neuroscience of attention and Bayesian inference
  • 39. Neuroscience of attention and Bayesian inference
  • 40. Neuroscience of attention and Bayesian inference integrated model of attention and recognition
  • 41. Neuroscience of attention and Bayesian inference PFC IT V4/PIT integrated model of V2 attention and recognition in collaboration with Desimone lab (monkey electrophysiology)
  • 42. Neuroscience of attention and Bayesian inference PFC feature-based attention IT V4/PIT integrated model of V2 attention and recognition in collaboration with Desimone lab (monkey electrophysiology)
  • 43. Neuroscience of attention and Bayesian inference PFC feature-based attention IT LIP/FEF V4/PIT spatial attention integrated model of V2 attention and recognition in collaboration with Desimone lab (monkey electrophysiology)
  • 44. Neuroscience of Attention and Bayesian inference PFC feature-based attention IT LIP/FEF V4/PIT spatial attention V2 see also Rao 2005; Lee & Mumford 2003 Chikkerur Serre & Poggio in prep
  • 45. Neuroscience of Attention and Bayesian inference PFC O feature-based object priors attention IT Fi LIP/FEF L V4/PIT Fli spatial attention location priors N V2 I see also Rao 2005; Lee & Mumford 2003 Chikkerur Serre & Poggio in prep
  • 46. Model predicts well human eye-movements Integrating (local) feature-based + (global) context-based cues accounts for 92% of inter-subject agreement! Chikkerur Tan Serre & Poggio in sub
  • 47. Model performance improves with attention performance (d’) one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
  • 48. Model performance improves with attention 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
  • 49. Model performance improves with attention 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
  • 50. Model performance improves with attention 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
  • 51. Model performance improves with attention mask no mask 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
  • 53. Main Achievements in Neo2 Extended + extensively tested feedforward model on real-world recognition tasks [Poggio]: matches neural data mimics human performance in rapid categorization performs at the level of state-of-the-art computer vision systems C++ software + interface available / 100x speed-up combined with saliency algorithm + tested on real-time street surveillance (video)
  • 54. Main Achievements in Neo2 Extended + extensively tested feedforward model on real-world recognition tasks [Poggio]: matches neural data mimics human performance in rapid categorization performs at the level of state-of-the-art computer vision systems C++ software + interface available / 100x speed-up combined with saliency algorithm + tested on real-time street surveillance (video) Demonstrated read out of cluttered natural images from monkey fMRI and physiology recordings in inferotemporal cortex [Freiwald and Poggio]: first decoding of cluttered complex images agreement with original feedforward model
  • 55. Main Achievements in Neo2 Extended + extensively tested feedforward model on real-world recognition tasks [Poggio]: matches neural data mimics human performance in rapid categorization performs at the level of state-of-the-art computer vision systems C++ software + interface available / 100x speed-up combined with saliency algorithm + tested on real-time street surveillance (video) Demonstrated read out of cluttered natural images from monkey fMRI and physiology recordings in inferotemporal cortex [Freiwald and Poggio]: first decoding of cluttered complex images agreement with original feedforward model Characterized neural encoding in V4, IT and FEF under passive and task- dependent viewing conditions [Desimone and Poggio]: characterized the dynamics of bottom-up vs. top-down visual information processing (characteristic timing signature of activity in V4 and IT vs. FEF) top-down, task-dependent, attention modulates features in V4 and IT
  • 57. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter
  • 58. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter Extended model for classification of video sequences (i.e., action recognition) [Poggio] tested on several video databases and shown to outperform previous algorithms
  • 59. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter Extended model for classification of video sequences (i.e., action recognition) [Poggio] tested on several video databases and shown to outperform previous algorithms Demonstrated read-out from human medial temporal lobe (MTL) [Koch] Decoding of natural scenes from single neurons in human MTL Improved ability of saliency model to mimic human gaze patterns
  • 60. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter Extended model for classification of video sequences (i.e., action recognition) [Poggio] tested on several video databases and shown to outperform previous algorithms Demonstrated read-out from human medial temporal lobe (MTL) [Koch] Decoding of natural scenes from single neurons in human MTL Improved ability of saliency model to mimic human gaze patterns Model used to transfer neuroscience data to biologically inspired vision systems
  • 61. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex:
  • 62. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex: Optical silencing and circuits stimulation technology based on X-rhodopsin
  • 63. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex: Optical silencing and circuits stimulation technology based on X-rhodopsin Multi-electrode network technology
  • 64. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex: Optical silencing and circuits stimulation technology based on X-rhodopsin Multi-electrode network technology Simultaneous recordings system across areas
  • 65. MIT team: From the neuroscience Poggio, Desimone, Serre, XXX data towards a system-level model of natural vision 1. Clutter and image ambiguities: Attention and cortical feedback 2. Learning and recognition of objects in video sequences
  • 66. Clutter and image ambiguities: Attention and cortical feedback IT
  • 67. Clutter and image ambiguities: Attention and cortical feedback Circuitry of attention and role of synchronization in top-down and bottom-up search tasks: monkey IT electrophysiology in V4, IT and FEF
  • 68. Clutter and image ambiguities: Attention and cortical feedback + IT
  • 69. Learning and recognition of objects in video sequences How current computer How brains learn vision systems learn
  • 70. Learning and recognition of objects in video sequences How current computer How brains learn vision systems learn
  • 72. Past Neo2 team: CalTech, Bremen & MIT Tomaso Poggio, MIT Bob Desimone, MIT Christof Koch, CalTech Winrich Freiwald, Bremen
  • 73. IT readout improves with attention stim cue transient change isolated object + object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
  • 74. IT readout improves with attention stim cue transient change isolated object + attention away from object object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
  • 75. IT readout improves with attention stim cue transient change isolated object + attention away from object object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
  • 76. MIT team: IT readout improves Poggio, Desimone, Serre, XXX with attention stim cue transient change isolated object attention on object + attention away from object object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
  • 77. Two functional classes of cells to explain invariant object recognition in the visual cortex Simple cells Complex cells Template matching Invariance Gaussian-like tuning max-like operation ~ “AND” ~”OR” Riesenhuber & Poggio 1999 (building on Fukushima 1980 and Hubel & Wiesel 1962)

Editor's Notes

  1. Here is the team that I am representing: Tomaso Poggio and Bob Desimone at MIT, Christof Koch at CalTech and Winrich Freiwald who used to be in Bremen now at CalTech and soon at Rockfeller.
  2. Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system. Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.
  3. Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system. Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.
  4. Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system. Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.
  5. Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million). The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision. Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.
  6. Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million). The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision. Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.
  7. Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million). The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision. Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.
  8. Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep. Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place. This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  9. Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep. Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place. This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  10. Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep. Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place. This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  11. Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep. Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place. This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  12. Let me show you one example of some of the validation we have performed on this model. Here for instance we considered a small population of about 200 random model units in one of the top stages of the architecture I just presented. From this population activity we can try to readout the object category of stimuli that are presented to the model. In fact we can try to train a classifier with stimuli presented at one position and scale and see how well it generalizes to other position and scale. This tells you how much built-in invariance is built in the population of units. We get the results indicated here by the light gray bar plots corresponding to different amount of shifts in position and scale. You can play the same game on neurons in IT which is the highest purely visual area and has been critically linked with primates ability to recognize objects invariant of their position and scale. Here we found that the model was able to predict not only the overall level of performance but also the range of invariance to position and scale.
  13. Another important validation is behavior assessed here using human psychophysics. As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place. An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not. Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
  14. Another important validation is behavior assessed here using human psychophysics. As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place. An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not. Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
  15. Another important validation is behavior assessed here using human psychophysics. As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place. An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not. Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
  16. This dependency of human and the model performance in terms of clutter motivated a subsequent electrophysiology experiment that was done with Winrich Freiwald during the Neo2 project. Here we found that this trend still holds for neurons in monkey IT cortex. Here we used fMRI to find areas that are differentially selective for animal vs. non-animal images. Winrich went on and recorded from a small pop of about 200 neurons in this area. You can see the readout results here on the right. We could reliably readout the animal category information from these difficult real-world images. Interestingly we found that there was also surprisingly high signal at the bold signal level (this is using a contrast agent).
  17. More recently we gained access to a population of epileptic patients with intractable epilepsy and that are planned for resective surgery. Typically the patients spend about a week at the hospital with implanted electrodes. They are being monitored 24/7 to try to essentially triangulate the epileptic site. Here these patients are a unique opportunity to not only get behavioral measurements but also simultaneous intracranial recordings (here we measure local field potentials from iEEG). I should emphasize that the spatial and temporal resolution that we get is several orders of magnitude higher than what we could get with non-invasive imaging technique such as fMRI. As an illustration, here is one electrode from one patient performing this animal vs non-animal categorization task. Here the electrode location has to be confirmed but is probably somewhere around the temporal lobe. Here you can see that already around 145 ms one can readout the presence or absence of an animal presented to the patient.
  18. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  19. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  20. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  21. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  22. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  23. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  24. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  25. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  26. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  27. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  28. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  29. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  30. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  31. of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  32. In parallel we have used this model in real-world computer vision applications. For instance we have developed a computer vision system for the automatic parsing of street scene images. Here are examples of automatic parsing by the system overlaid over the original images. The colors and bounding boxes indicate predictions from the model (eg green for trees etc).
  33. We have done a number of improvements in terms of the implementation of this model. The original matlab implementation of this model was quite slow... We have been working on a number of ways to speed up this model. We started with efficient multi-threaded C/C++ implementation and finally went for exploiting the recent gains in computational power from graphics processing hardware (GPUs).
  34. More recently we have extended the approach for the recognition of human actions such as running, walking, jogging, jumping, waving etc... In all cases we have shown that the resulting biologically motivated computer vision systems were performing on par or better than state-of-the-art computer vision systems.
  35. There are several other systems that
  36. Let me switch gears and tell you a little bit about our work on attention. As I showed you earlier, one key limitation of this feedforward architecture is that it performs well for the recognition of objects when the objects to be recognized is large and the amount of background clutter is limited. I have shown you that consistent with human psychophysics and monkey electrophysiology the performance of the model decreases quite significantly when the amount of clutter increases. Here we have been working with the assumption that the way the visual system overcome this limitation is via cortical feedback and shifts of attention. In particular our working hypothesis is that the role of spatial attention is to suppress the clutter so that the object of interest appears as if it were presented in isolation. In collaboration with electrophysiology labs we are studying the circuits and networks of visual areas involved in attention.
  37. In collaboration with electrophysiology labs we are studying the circuits and networks of visual areas involved in attention which involves a complex interaction between the ventral stream and area V4 in particular, prefrontal areas such as the FEF as well as the parietal cortex.
  38. We had to perform two key extensions on this model. First we have assumed that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas. we also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence)
  39. We had to perform two key extensions on this model. First we have assumed that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas. we also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence)
  40. This attentional mechanisms can be casted in a probabilistic Bayesian framework whereby the parietal cortex represents Location variables, the ventral stream represents feature variables. These are our image fragments. Variables for the target object are encoded in higher areas such as PFC... This framework is inspired by an earlier model by Rao to explain spatial attention and is a special case of the computational model of the visual cortex described by David Mumford and that probably most of you know...
  41. We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  42. We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  43. We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  44. We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  45. Let me just summarize some of our main achievements from phase 0 of Neo2.
  46. Let me just summarize some of our main achievements from phase 0 of Neo2.
  47. Let me just summarize some of our main achievements from phase 0 of Neo2.
  48. If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis: In particular we need to be able to: 1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits. 2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access 3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
  49. If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis: In particular we need to be able to: 1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits. 2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access 3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
  50. If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis: In particular we need to be able to: 1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits. 2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access 3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
  51. At the same time, these neuroscience data will allow us to not only validate but also extend existing models of the visual cortex and hopefully improve their recognition capabilities. In particular if we want to have computer systems that can compete with the primate visual system we need to go beyond rapid categorization tasks and study vision in more natural cases. In particular, I think there are two key Neuroscience questions that need to be studied: First as I eluded too already in this talk, cortical feedback and shifts of attention are likely to be the key computational mechanisms by which the visual system solves most of the difficulties inherent to vision namely dealing with significant amount of clutter as well as ambiguity in the visual input because of occlusion or low signal to noise. The second one is the processing of image sequences not as a succession of independent snapshots as I showed you in the model of rapid object categorization but rather models that can exploit the temporal continuity of image sequences both for learning invariance to 2D transformations (zooming and looming, translation, 3D rotation etc) but also for the recognition of object in motion.
  52. Along those lines we have started to make significant progress in understanding the circuitry of attention and in particular how spatial attention works to suppress the clutter in image displays of this kind.
  53. The next step is obviously to move towards more natural stimulus presentations.
  54. I think significant progress in computer vision will come from the use of video sequences and the exploitation of temporal continuity in those sequences. Here is the way current computer vision systems treat the visual world: As a collection of independent frames. Obviously the visual world is much richer than that and time is obviously an important component of visual perception. Obviously babies do not learn to recognize giraffes via labeled examples of this kind. Instead this baby who is going to the zoo perhaps for the first time has access to a much richer information, whereby giraffes undergo transformations such as rotation in depth, looming or shifting on the retina in a smooth continuous way. It is our belief that by exploiting these principles we will be able to build better learning algorithms.
  55. Most of the work in the areas of computer vision and visual neuroscience has focused on the recognition of isolated objects. However, vision is much more than just classification, as it involves interpreting, parsing and navigating in visual scenes. By just looking, a human observer could essentially answer an infinite number of questions about an image: for instance, about the location and the boundary of an object, how to grasp it or to navigate over it. These are essential problems for robotics applications, which in essence have remained unaddressed in the field of neuroscience.
  56. Here is the team that I am representing: Tomaso Poggio and Bob Desimone at MIT, Christof Koch at CalTech and Winrich Freiwald who used to be in Bremen now at CalTech and soon at Rockfeller.
  57. We have implemented the approach in the context of our animal search model mostly improves on medium and far conditions
  58. We have implemented the approach in the context of our animal search model mostly improves on medium and far conditions
  59. We have implemented the approach in the context of our animal search model mostly improves on medium and far conditions
  60. Computational considerations suggest that you need two types of operations and therefore functional classes of cells for invariant object recognition The gaussian-bell tuning was motivated by a learning algorithm called Radial Basis Function while the max operation was motivated by the standard scanning approach in computer vision and theoretical arguments from signal processing. The goal of the simple units is to increase the complexity of the representation. Here on this example by pooling together the activity of afferent units with different orientations via this Gaussian-like tuning. This Gaussian tuning is ubiquitous in the visual cortex from orientation tuning in V1 to tuning for complex objects around certain poses in IT.\\ The complex units pool together afferent units with the same preferred stimuli eg vertical bar but slightly different positions and scales. At the complex unit level we thus build some tolerance with respect to the exact position and scale of the stimulus within the receptive field of the unit.