CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Learning Static Object Segmentation From Motion Segmentation

Michael G. Ross & Leslie Pack Kaelbling

Introduction

Image segmentation is the discovery of salient regions in static images and has a history reaching back to the Gestalt psychologists [1]. There are many computer vision approaches to this problem, but they are difficult to compare because there is no readily accessible ground truth. In recent years, this situation has improved as researchers such as Martin et al. [2] and Konishi et al. [3] have used human-segmented data to train and test boundary detection algorithms.

This work further grounds the image segmentation problem by replacing it with the better-defined goal of object segmentation and by using an automatically segmented database as the training and testing set. Object segmentation is the task of grouping pixels in a static image into regions that correspond to the objects in the underlying scene (similar to figure-ground segmentation). In this formulation, objects are sets of elements that move coherently in the world. If objects can be distinguished by their motion, a motion segmentation algorithm, which divides moving objects from their static surroundings, can provide a partially labeled database of segmentations. Such a database can be automatically gathered by a robot or other vision-processing agent by simply observing object motion in a new domain. The static image and shape properties associated with the boundaries and interiors of moving objects can be used to train an object segmentation model. Then the model can determine the segmentation of static images. Its performance can be measured by comparing the segmentation it produces on individual video frames to the motion segmentation calculated across the frames.

This abstract describes the Segmentation According to Natural Examples (SANE) object segmentation algorithm, which is trained and tested in this framework. Videos of moving objects are motion-segmented using background subtraction. This automatically labeled data provides a large, cheap training set that is used to learn the shape and image statistics of object segmentation. Then, presented with a new, static image, the learned model constructs a Markov random field model which can infer the underlying object segmentation using the belief propagation algorithm [4]. This algorithm outperforms a standard implementation of the general-purpose normalized cuts segmentation algorithm [5] on the object segmentation task and outperforms the trained Martin boundary detectors [2] on detecting the object boundaries.

Getting segmentation labels from background subtraction

Model and Algorithm

The SANE object segmentation model divides an image into a lattice of 5 pixel by 5 pixel, non-overlapping patches and assigns a variable to represent the segmentation at each patch. As long as all the objects present in a scene are separated from each other, only two pixel labels are needed to segment them. In our representation, the segmentation of each 5 by 5 patch is represented by the local boundary (non-existent or parameterized by entry, exit, and inflection points), and a one-bit parity that indicates which segmentation label is on which side of the boundary.

The variables are linked into a Markov random field (MRF), an undirected graphical probabilistic model. Each node is linked to its up, down, left, and right neighboring node. Every individual node has a compatibility function that indicates the desirability of different segmentation assignments given the local image information, and every neighboring pair of nodes have a compatibility function that measure how well their assignments agree.

Unlike Bayesian networks, the compatibility functions for MRFs do not correspond to easily measured probability functions. Instead, we set them using an approximate method developed by Wainwright et al. [6] which relates them to the observed marginal probability functions on each node and neighboring pair. These probabilities can be learned from the training data, which consist of videos of moving objects. The moving objects can be segmented from their surroundings in each video frame by background-subtraction algorithms [7,8], providing us with labeled training data. By using this data to learn the MRF compatibility functions, the model is trained to recognize static image boundaries that correspond well to the motion boundaries observed in the video.

Given a new image, a segmentation MRF is constructed using the learned compatibilities and the segmentation is inferred using the loopy belief propagation algorithm [4,9], an algorithm that has become increasingly popular in computer vision [10]. In our MRF, it efficiently computes an approximately optimal assignment to the segmentation variables.

In some cases, multiresolution models, which feature MRFs trained on half and full-resolution training data, improve our results by better capturing long-distance dependencies.

Results

The SANE algorithm outperforms both the standard normalized cuts algorithm [5] and the Martin et. al [2] learned edge detectors at finding object boundaries in our data. In the figure, an example from the Traffic data demonstrates that SANE excludes several non-boundary edges that the Martin detector detects, showing the benefit of its learned shape model. Over all the test data, the best Martin detector has a lower precision than SANE because of this tendency to detect boundaries that don't correspond with the known motion boundaries.

In the future, we plan to expand SANE's shape model to include T-junctions in order to handle the segmentation of overlapping objects and to explore enhancements to the image features it uses.

Support

This work was funded in part by the Defense Advanced Research Projects Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services Division, under Contract No. NBCHD030010, and in part by the Singapore-MIT Alliance agreement dated 11/6/98.

References:

[1] S. Palmer. Vision Science: Photons to Phenomenology. The MIT Press. 1999.

[2] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(5). 2004.

[3] S. Konishi, A. Yuille, J. Coughlan, and S.C. Zhu. Statistical edge detection: Learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(1). 2003.

[4] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. 1988.

[5] J. Shi and J. Malik. Normalized cuts and image segmentation. In Computer Vision and Pattern Recognition. 1997

[6] M. Wainwright, T. Jaakkola, and A. Willsky. Treereweighted belief propagation and approximate ML estimation by pseudo-moment matching. In Workshop on Artificial Intelligence and Statistics. 2003.

[7] C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition. 1999.

[8] J. Migdal and W.E.L. Grimson. Background subtraction using markov thresholds. In IEEE Workshop on Motion and Video Computing. 2005.

[9] Y. Weiss. Belief propagation and revision in networks with loops. Technical Report 1616, MIT Artificial Intelligence Laboratory. 1997.

[10] W. Freeman, E. Pasztor, and O. Carmichael. Learning low-level vision. International Journal of Computer Vision 40(1). 2000.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)