Learning a Dictionary of Shape-Components in Visual CortexThomas Serre & Tomaso PoggioThe ProblemThe tuning properties of neurons in the ventral stream of visual cortex, from primary visual cortex (V1) to inferotemporal cortex (IT), are likely to play a key role for visual perception in primates and in particular for their object recognition abilities. The tuning of specific neurons probably depends, at least in part, on visual experience. We describe a model of visual plasticity and learning from V4 to IT - extending the initial version of the standard model of object recognition in primate cortex - that accounts for known physiological data and invariance properties of neurons in IT. When exposed to many natural images the model generates a large set of shape-tuned units which can be interpreted as a universal (redundant) dictionary of shape-components with the properties of over-completeness and non-uniqueness. When tested on real-world natural images, the model outperforms the best computer vision systems on several different recognition tasks. We also show that the tuning-properties of the set of shape-tuned units generated by the model are compatible with those of cortical neurons. BackgroundThe standard model aims at summarizing a set of basic facts about cortical mechanisms of recognition established over the last decade by several physiological studies in non-human primates. The model is a quantitative extension of the classical feed-forward, hierarchical architecture originally proposed by Hubel & Wiesel to build complex cells from simple cells in primary visual cortex (Hubel and Wiesel, 1962, 1965). Object recognition in cortex is thought to be mediated by the ventral visual pathway running from primary visual cortex, V1, over extrastriate visual areas V2 and V4 to inferotemporal cortex, IT. Starting from simple cells in V1 with small receptive fields that respond preferably to oriented bars, neurons along the ventral stream show a gradual increase in receptive field size as well as in the complexity of their preferred stimuli. At the top of the ventral stream, in anterior inferotemporal cortex (AIT), cells are tuned to complex stimuli such as faces and other objects. The tuning of the view-tuned and object-tuned cells in AIT very likely depends on visual experience (Jagadeesh et al., 2001; Kobatake et al., 1998; Logothetis et al., 1995; Sakai et al., 1994). More recent studies also suggest adult plasticity below IT in V4 (Rainer et al.., 2004; Yang and Maunsell, 2004) and even V1 (Schoups et al., 2004). In brief, the accumulated evidence points to a visual system that a) can identify and categorize visual objects within a time too short to allow for recursions between areas and in which b) plasticity is at the stage of IT and probably also at earlier stages, c) the tuning of IT cells is obtained through a hierarchy of cortical stages involving cells tuned to simpler features, d) the basic ability to generalize depends on the combination of cells tuned by visual experience and e) a specific pooling operation provides invariance to image-based transformations. MethodsThe standard model relies on two key computational mechanisms, i.e. tuning by the simple S units to build object-selectivity and softmax by the complex C units to gain invariance to object transformations. In the model, the two layers alternate and simple S units take their inputs from neighboring units tuned to different preferred stimuli that are combined with a Gaussian-like tuning function thus increasing object selectivity while complex C units pool over units tuned to the same preferred stimuli but at slightly different positions and scales through a softmax operation, thereby introducing tolerance to scale and translation. This gradual increase in both selectivity and scale is critical to avoid both 1) a combinatorial explosion in the number of units and 2) the binding problem. To account for the effect of visual experience in the tuning of neurons in areas V4 and PIT, instead of the hardwired units of the original model, we assume that simple (S2) and complex (C2) cells in V4 become tuned to patches of C1 units (corresponding to complex cells in V1 and V2) activity that repeats across different images of the same objects. We have already simulated a simplified version of Foldiak's trace rule (Foldiak, 1991) to generate S2 and C2 cells that become tuned to complex features of images. After presentation of many natural images, the units become tuned to complex features for instance of face-components if a sequence of face images (in the presence of background) is presented (even if the objects are not at the same position and scale). Learning is task-independent and relies on as little as temporal continuity (e.g. the same object being present during a temporal sequence of images). The same process is iterated in PIT (S3 and C3 cells) where now the neurons become tuned to patches of activities in V4. We also assume, consistently with available data, that there are direct projections from V2 (roughly corresponding to the S1 and C1 cells of the model) to PIT generating S2b units with more selectivity (they are tuned to larger patches with a larger number of subunits than the S2 units in V4). ResultsWe ran extensive model simulations on datasets of natural images and compared with state-of-the-art computer vision systems. Fig. 2 contains a summary table of the performance of the model on various datasets. For both our system and the benchmarks, we report the error rate at the equilibrium point, i.e. the error rate at which the false positive rate equals the miss rate. Results obtained with the model are consistently higher than those previously reported on the Caltech datasets (available at www.vision.caltech.edu). Our system also seems to outperform the component-based system by (Heisele et al., 2002) on the MIT-CBCL face database as well as a fragment-based system by (Leung, 2004) that uses template-based features with gentle Ada Boost, similar to the system by (Torralba et al., 2004). We also evaluated the performance of our system on a 101-object dataset for comparison with (Fei-Fei et al., 2003), also available at www.vision.caltech.edu. Fig. 2 contains sample results from our system on the 101-object database (trained with only 30 positive training examples). The reader can refer to (Serre et al., 2005) for a complete set of results. Finally, we conducted initial experiments on the multiple classes case. For this task we used the 101-object dataset. We split each category into a training set of size 30 and a test set containing the rest of the images. We used a simple multiple-class linear SVM as classifier. The SVM applied the all-pairs method for multiple label classification, and was trained on 102 labels (101 categories plus the background category, i.e. 102 AFC). We obtained above, and 42% correct classification rate when using 30 training examples per class averaged over 10 repetitions (chance below 1%). We have also shown that the tuning properties of the shape-tuned units generated by the model after passive exposition to many random natural images in intermediate layers are compatible with cortical neurons in area V4, see Fig. 3 and abstract by (Cadieu et al., 2005). Remarks and Predictions
References:[1] Serre et al., Object recogntion with features inspired by visual cortex. To appear in Proc. IEEE CVPR, 2005. [2] Fei-Fei et al., Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Proc. IEEE CVPR, Workshop on Generative-model Based Vision, 2004. [3] Fergus et al., Object class recognition by unsupervised scale-invariant learning. In Proc. IEEE CVPR, 2003. [4] Foldiak, P. Learning invariance from transformation sequences, In Neural Computation, 1991. [5] Heisele et al., Categorization by learning and combining object parts. In NIPS 2001, 2002. [6] Toralba et al., Sharing features: Efficient boosting procedures for multiclass object detection, In Proc. IEEE CVPR, 2004 [7] Weber et al., Unsupervised learning of models for recognition, In ECCV, 2000. |
||
|