CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Learning a Dictionary of Shape-Components in Visual Cortex

Thomas Serre & Tomaso Poggio

The Problem

The tuning properties of neurons in the ventral stream of visual cortex, from primary visual cortex (V1) to inferotemporal cortex (IT), are likely to play a key role for visual perception in primates and in particular for their object recognition abilities. The tuning of specific neurons probably depends, at least in part, on visual experience. We describe a model of visual plasticity and learning from V4 to IT - extending the initial version of the standard model of object recognition in primate cortex - that accounts for known physiological data and invariance properties of neurons in IT. When exposed to many natural images the model generates a large set of shape-tuned units which can be interpreted as a universal (redundant) dictionary of shape-components with the properties of over-completeness and non-uniqueness. When tested on real-world natural images, the model outperforms the best computer vision systems on several different recognition tasks. We also show that the tuning-properties of the set of shape-tuned units generated by the model are compatible with those of cortical neurons.

Background

The standard model aims at summarizing a set of basic facts about cortical mechanisms of recognition established over the last decade by several physiological studies in non-human primates. The model is a quantitative extension of the classical feed-forward, hierarchical architecture originally proposed by Hubel & Wiesel to build complex cells from simple cells in primary visual cortex (Hubel and Wiesel, 1962, 1965). Object recognition in cortex is thought to be mediated by the ventral visual pathway running from primary visual cortex, V1, over extrastriate visual areas V2 and V4 to inferotemporal cortex, IT. Starting from simple cells in V1 with small receptive fields that respond preferably to oriented bars, neurons along the ventral stream show a gradual increase in receptive field size as well as in the complexity of their preferred stimuli. At the top of the ventral stream, in anterior inferotemporal cortex (AIT), cells are tuned to complex stimuli such as faces and other objects. The tuning of the view-tuned and object-tuned cells in AIT very likely depends on visual experience (Jagadeesh et al., 2001; Kobatake et al., 1998; Logothetis et al., 1995; Sakai et al., 1994). More recent studies also suggest adult plasticity below IT in V4 (Rainer et al.., 2004; Yang and Maunsell, 2004) and even V1 (Schoups et al., 2004).

In brief, the accumulated evidence points to a visual system that a) can identify and categorize visual objects within a time too short to allow for recursions between areas and in which b) plasticity is at the stage of IT and probably also at earlier stages, c) the tuning of IT cells is obtained through a hierarchy of cortical stages involving cells tuned to simpler features, d) the basic ability to generalize depends on the combination of cells tuned by visual experience and e) a specific pooling operation provides invariance to image-based transformations.

Methods

The standard model relies on two key computational mechanisms, i.e. tuning by the simple S units to build object-selectivity and softmax by the complex C units to gain invariance to object transformations. In the model, the two layers alternate and simple S units take their inputs from neighboring units tuned to different preferred stimuli that are combined with a Gaussian-like tuning function thus increasing object selectivity while complex C units pool over units tuned to the same preferred stimuli but at slightly different positions and scales through a softmax operation, thereby introducing tolerance to scale and translation. This gradual increase in both selectivity and scale is critical to avoid both 1) a combinatorial explosion in the number of units and 2) the binding problem.

To account for the effect of visual experience in the tuning of neurons in areas V4 and PIT, instead of the hardwired units of the original model, we assume that simple (S2) and complex (C2) cells in V4 become tuned to patches of C1 units (corresponding to complex cells in V1 and V2) activity that repeats across different images of the same objects. We have already simulated a simplified version of Foldiak's trace rule (Foldiak, 1991) to generate S2 and C2 cells that become tuned to complex features of images. After presentation of many natural images, the units become tuned to complex features for instance of face-components if a sequence of face images (in the presence of background) is presented (even if the objects are not at the same position and scale). Learning is task-independent and relies on as little as temporal continuity (e.g. the same object being present during a temporal sequence of images). The same process is iterated in PIT (S3 and C3 cells) where now the neurons become tuned to patches of activities in V4. We also assume, consistently with available data, that there are direct projections from V2 (roughly corresponding to the S1 and C1 cells of the model) to PIT generating S2b units with more selectivity (they are tuned to larger patches with a larger number of subunits than the S2 units in V4).

Results

We ran extensive model simulations on datasets of natural images and compared with state-of-the-art computer vision systems. Fig. 2 contains a summary table of the performance of the model on various datasets. For both our system and the benchmarks, we report the error rate at the equilibrium point, i.e. the error rate at which the false positive rate equals the miss rate. Results obtained with the model are consistently higher than those previously reported on the Caltech datasets (available at www.vision.caltech.edu). Our system also seems to outperform the component-based system by (Heisele et al., 2002) on the MIT-CBCL face database as well as a fragment-based system by (Leung, 2004) that uses template-based features with gentle Ada Boost, similar to the system by (Torralba et al., 2004). We also evaluated the performance of our system on a 101-object dataset for comparison with (Fei-Fei et al., 2003), also available at www.vision.caltech.edu. Fig. 2 contains sample results from our system on the 101-object database (trained with only 30 positive training examples). The reader can refer to (Serre et al., 2005) for a complete set of results. Finally, we conducted initial experiments on the multiple classes case. For this task we used the 101-object dataset. We split each category into a training set of size 30 and a test set containing the rest of the images. We used a simple multiple-class linear SVM as classifier. The SVM applied the all-pairs method for multiple label classification, and was trained on 102 labels (101 categories plus the background category, i.e. 102 AFC). We obtained above, and 42% correct classification rate when using 30 training examples per class averaged over 10 repetitions (chance below 1%).

We have also shown that the tuning properties of the shape-tuned units generated by the model after passive exposition to many random natural images in intermediate layers are compatible with cortical neurons in area V4, see Fig. 3 and abstract by (Cadieu et al., 2005).

Remarks and Predictions

The architecture we described with the associated learning rule creates a redundant dictionary of shape-components in PIT with different degrees of complexity / selectivity and invariance. For instance, PIT neurons receiving direct projections from V2 are tuned to complex features learned from experience and consisting of configurations of several subunits of the V1 type (each one with a limited range of scale and position invariance, similarly to complex cells in V1). Projections from V4 to PIT support simpler feature cells with a larger degree of invariances.
In the present model, task-dependent supervised learning is needed only at cortical levels higher than PIT. We show with simulations that the features learned from quasi-passive visual experience are capable of supporting the recognition of many different objects.
In particular our results suggest that the same general learning and computational mechanisms could handle the recognition of many objects and faces as well. This is in contrast with theories that speculate that faces are a "special" class of objects and require different processing strategies.
The model assumes that tuning depends on visual experience. An obvious prediction is that the properties of tuning - in terms of selectivity and invariance - can be affected by manipulating visual experience.
Since the dictionary of shape-tuned units is overcomplete, it is natural to assume that the specific tuning depends on the particular sequence of visual experience. We conjecture that similar performance on a recognition task can be achieved with different dictionaries of tuned units, created with somewhat different visual experiences.
Our model is consistent with and extends the notion of features described by Tanaka and colleagues in V4 and PIT.
Our scheme for learning a dictionary of shape-components in the ventral pathway has already been extended to the recognition of motion in the dorsal pathway and shown to be consistent with psychophysics in (Sigala et al., 2005).

References:

[1] Serre et al., Object recogntion with features inspired by visual cortex. To appear in Proc. IEEE CVPR, 2005.

[2] Fei-Fei et al., Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Proc. IEEE CVPR, Workshop on Generative-model Based Vision, 2004.

[3] Fergus et al., Object class recognition by unsupervised scale-invariant learning. In Proc. IEEE CVPR, 2003.

[4] Foldiak, P. Learning invariance from transformation sequences, In Neural Computation, 1991.

[5] Heisele et al., Categorization by learning and combining object parts. In NIPS 2001, 2002.

[6] Toralba et al., Sharing features: Efficient boosting procedures for multiclass object detection, In Proc. IEEE CVPR, 2004

[7] Weber et al., Unsupervised learning of models for recognition, In ECCV, 2000.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)