Scene Understanding in the Street Scenes Database
Stan Bileschi, Lior Wolf & Tomaso Poggio
The StreetScenes project is a multi year endeavor by the Center for Biological and Computational Learning (CBCL) to promote and develop technologies for scene understanding. The first phase of this project was to acquire an appropriate training and testing platform, manifested in the StreetScenes database; a data base of over 3,000 high resolution images of outdoor scenes in Boston and the surrounding areas. These images have been hand labelled for around 10 object categories, including buildings, trees, cars, pedestrians, etc. Currently we are developing a unified approach to automatically learn reliable detectors for these objects. The goal is to develop an architecture capable of learning a function from images to image descriptions directly from the training data. Some examples of StreetScene images, their true labelling, and some empirical results are illustrated in Fig. 1. In the second row of Fig. 1 different fill colors indicated the hand labelling of large texture-based objects within the scene, with brown indicating buildings, green indicating trees, and blue and gray marking sky and road respectively. The shape-based objects are labelled with bounding rectangles, with orange rectangles surrounding cars, and yellow rectangles surrounding pedestrians. White regions have no label. This same annotation is used in the empirical results from the scene understanding system in the last row. In these images text, rather than color, indicates the difference between pedestrians and vehicles.
Our approach to scene understanding depends on the robust combination of a number of continuously developing technologies, including object detection [1,2], texture recognition [3,4], and contextual understanding [5,6]. Other key enabling technologies have been the development of the theory of statistical learning  and intuitions drawn from rigorous studies of the biological vision system .
One difference between our system and  is that our system operates in a discriminative feed-forward fashion and is capable of detecting large texture based objects.
We divide the class of objects into two distinct sets, texture-based objects and shape-based objects, nearly analogous to Adelson's concept of things and stuff. These two classes of objects are handled using different learning strategies, even though both will rely upon the same features. Fig. 2 illustrates a data flow diagram for this architecture. The dashed arrows and solid arrows indicate the pathways for detection of the texture-based and shape-based objects, respectively. Additionally, the arrow labelled ``context,'' connecting the two pathways, symbolizes that the system is capable of using detections of the former class of objects to aid in the detections of the latter.
Standard Model Feature Set
The unifying principle of our scene understanding system is that all detectors rely upon visual features drawn from the same biologically inspired model of low-level visual processing, called the Standard Model Feature set (SMF) . The SMF feature set consists of a multilayered approach to image processing combining linear filters, morphological dilation, and decimation. Previous detection systems have shown that the top level SMF features are good for the unsegmented object detection task. In the StreetScenes system we extend these experiments to the segmented object detection problem in cluttered natural scenes, and for object types not previously addressed. Encouraging results suggest that the SMF features outperform features engineered explicitly for these tasks.
As illustrated in Fig. 1, our current system is capable of robust detection for 7 widely different object categories in natural scenes, and performance on par with state of the art detectors designed explicitly for subsets of these objects. Specifically, the SMF based detectors for cars, pedestrians, and bicycles outperform global SVM models  and more recent part based methods . The texture based detectors for buildings, trees, roads, and sky outperform the texton based detector of Malik , also designed for texture recognition. Finally, we have shown encouraging results involving contextual information flow between object types, such that the detection of a road may influence the detection of a car.
 Berned Heisele, Thomas Serre, Sayan Mukherjee and Tomasso. Poggio. Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images. In: Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), IEEE Computer Society Press, Kauai, Hawaii, Vol. 2, 18-24, December 2001.
 Chad Carson, Serge Belongie, Hayit Greenspan and Jitendra Malik, Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying; in IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(8), 1026-1038, August 2002.
 Antonio Torralba, Kevin P. Murphy and William T. Freeman (2004). Contextual Models for Object Detection using Boosted Random Fields. To appear in Adv. in Neural Information Processing Systems (NIPS)", (Also MIT AI Lab Memo AIM-2004-008, April 14).
 Thomas Serre, Lior Wolf and Tomaso Poggio. A New Biologically Motivated Framework for Robust Object Recognition. CBCL Paper #243/AI Memo #2004-026, Massachusetts Institute of Technology, Cambridge, MA, November, 2004.