CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

A comparison between a feedforward model of the ventral stream of visual cortex and human observers in a rapid categorization task

Thomas Serre & Tomaso Poggio


We have developed a quantitative theory to account for the computations performed by the feedforward path of the ventral stream of visual cortex and the local circuits implementing them [1,2]. The theory, which extends the Hubel and Wiesel hierarchical model [3] as well as several other proposals, is qualitatively and quantitatively consistent with (and in many cases predicts) several properties of cells in V1, V2, V4, and IT, as well as results from fMRI and psychophysical experiments (see [1] for an overview). Here we show that a model instantiating the theory predicts the level of performance of human observers on a complex categorization task.

Figure 1: Model overview.

The theory summarizes a set of basic facts about cortical mechanisms of recognition established over the last decade by several physiological studies in non-human primates. Object recognition in cortex is thought to be mediated by the ventral visual pathway running from primary visual cortex, V1, over extrastriate visual areas V2 and V4 to inferotemporal cortex, IT (see Fig. 1). Starting from simple cells in V1 with small receptive fields that respond preferably to oriented bars, neurons along the ventral stream show a gradual increase in receptive field size as well as in the complexity of their preferred stimuli. At the top of the ventral stream, in anterior inferotemporal cortex (AIT), cells are tuned to complex stimuli such as faces and other objects. The tuning of the view-tuned and object-tuned cells in AIT very likely depends on visual experience.

In brief the accumulated evidence points to a visual system that a) can identify and categorize visual objects within a time too short to allow for recursions between areas and in which b) plasticity is at the stage of IT and probably also at earlier stages, c) the tuning of IT cells is obtained through a hierarchy of cortical stages involving cells tuned to simpler features, d) the basic ability to generalize depends on the combination of cells tuned by visual experience and e) a specific pooling operation provides invariance to image-based transformations.

The Approach

The theory is a significant extension of a model introduced by Riesenhuber & Poggio [4]: it now includes unsupervised learning from natural images to account for the tuning properties of neurons from V2 to IT. During this learning stage which can be regarded as an imprinting process, units learn from natural images by storing in their synaptic weights the specific pattern of feedforward afferent activity. We conjecture from our simulations that the resulting large number of tuned units constitutes a universal and redundant dictionary of image-features which show a range of selectivities for image patterns, and invariances for translation and scale, and support the recognition of many different object categories [5,2]. When tested on real-world natural images, the model performs at least at the level of some of the best computer vision systems on several categorization tasks.

Table 1: Comparison between the model and some of the best computer vision systems on several standard datasets.

Predicting human performance in a rapid categorization task (w| Aude Oliva, MIT BCS)

We show that a model instantiating the theory is capable of performing recognition on datasets of complex images at the level of human observers. The task we use is a rapid, animal vs. non-animal recognition task [6], paired with a backward mask protocol [7] that interrupts visual processing and blocks back-projections. The stimulus is presented for 20 ms, followed by a time interval of 30ms before the mask (1/f noise) appears for 80 ms.

Figure 2: The task.

Results indicate that for this SOA the model mimics the level of performance of human-observers very well. Additionally, we show that when the stimuli are rearranged into subcategories (reflecting the amount of clutter in the image and thus the level of difficulty), the model successfully predicts human performance on individual subcategories. The model seems to be able to predict (to some extent) images that are more likely to be misclassified by human observers.The implication is that, for this SOA, the back-projections do not play a significant role and the model provides a satisfactory description of the feedforward path.

Figure 3: Comparison between the model and human observers (n=24).

Finally, we show that image rotation (90 and 180 deg) influences the performance of both the model and human-observers in a similar manner.

Figure 4: Comparison between the model and human observers (n=14) for different image orientations.
Conclusion and future work

A hierarchical architecture, reflecting the known physiology and anatomy of visual cortex, achieves a high level of correlation with humans – and comparable accuracy – on a difficult (but rapid) recognition task. This suggests that we may be close to understanding the main outline of the computations performed during immediate recognition in the feedforward path of the ventral stream. Future work will focus on visual processing after this first feedforward sweep and, in particular, try to understand the nature of the processing by the back-projections found across visual cortex.


This report describes research done at the Center for Biological & Computational Learning, which is in the McGovern Institute for Brain Research at MIT, as well as in the Dept. of Brain & Cognitive Sciences, and which is affiliated with the Computer Sciences & Artificial Intelligence Laboratory (CSAIL).

This research was sponsored by grants from: Office of Naval Research (DARPA) Contract No. MDA972-04-1-0037, Office of Naval Research (DARPA) Contract No. N00014-02-1-0915, National Science Foundation (ITR/SYS) Contract No. IIS-0112991, National Science Foundation (ITR) Contract No. IIS-0209289, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218693, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218506, and National Institutes of Health (Conte) Contract No. 1 P20 MH66239-01A1.Additional support was provided by: Central Research Institute of Electric Power Industry (CRIEPI), Daimler-Chrysler AG, Compaq/Digital Equipment Corporation, Eastman Kodak Company, Honda R&D Co., Ltd., Industrial Technology Research Institute (ITRI), Komatsu Ltd., Eugene McDermott Foundation, Merrill-Lynch, NEC Fund, Oxygen, Siemens Corporate Research, Inc., Sony, Sumitomo Metal Industries, and Toyota Motor Corporation.


[1] T. Serre, M. Kouh, C. Cadieu, G. Kreiman and T. Poggio. A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex. CBCL Paper #259/AI Memo #2005-036, Massachusetts Institute of Technology, Cambridge, MA, December, 2005.

[2] T. Serre. Learning a dictionary of shape-components in visual cortex: Comparison with neurons, humans and machines. PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, 2006.

[3] D. H. Hubel and T. N.Wiesel. Receptive fields and functional architecture in two non striate visual areas (18 and 19) of the cat. J. Neurophys., 28:229–289, 1965.

[4] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nat. Neurosci., 2:1019–1025, 1999

[5] T. Serre, L.Wolf, and T. Poggio. Object recognition with features inspired by visual cortex. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Diego, 2005.

[6] S.J. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 381: 520–522, 1996.

[7] N. Bacon-Mace, M.J.Mace, M. Fabre-Thorpe, and S.J. Thorpe. The time course of visual processing: backward masking and natural scene categorization. Vis. Res., 45:1459–1469, 2005.


vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu