CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Region-Based Image Modeling for Tracking

Gerald Dalley & Josh Migdal


Many tasks in computer vision require segmenting moving objects from the background and/or each other as a first step. For example, a system for recording and automatically annotating business meetings is greatly assisted by being able to track the shape of participants so events like talking or hand raising can be identified. In an outdoor security system, knowing the locations of all visible people or vehicles allows them to be tracked. This tracking can then be used to provide better compression (e.g. preserve detail on moving objects) and/or analyze activities. Many techniques have been developed over the years to perform this segmentation of moving objects. We are developing several methods to improve upon existing approaches and address key limitations.

Prior Work

Arguably the simplest approach to motion segmentation is frame differencing: given two images from successive video frames, IA(x,y) and IA(x,y), we label a pixel at x, y as foreground if |IA(x,y)-IB(x,y)| > t for some threshold t. Foreground pixels can then be grouped and tracked between frames using a variety of techniques. It is well-known that a major limitation of frame differencing is that only edges and textured regions are marked as foreground. It is also not very robust to image noise. In Figure 1c, we see that the segmentation is very noisy, it misses most of the people, and it picks up noise and tree motion.

Input Video Ideal Results
(a) Input Video
(b) Ideal Results
Frame Differencing Stauffer and Grimson
(c) Frame Differencing
(d) Stauffer and Grimson

Figure 1: Motion segmentation examples. (a) is a single frame from a video sequence with five people walking through and trees waiving in the wind. On the right of the image is a person holding some flowers, on the far side of the street is a group of three people, and on the bottom of the image is a person entering a car. (b) shows a human-made segmentation into the five people. This is a sketch of idealized results. (c) shows the results of performing 2-frame background subtraction. (d) shows the foreground pixels when using a background model based on Stauffer and Grimson's [1].

A typical approach in current vision systems is to build up an adaptive statistical model of the background image. When a new frame is presented, pixels that are unlikely to have been generated by this model are labeled as foreground. Stauffer and Grimson [1] represent the background as a mixture of Gaussians. At each pixel, a collection of Gaussians emit values in RGB (red, green, blue) space. When a pixel value is observed in a new frame, it is matched to the Gaussian most likely to emit it. The Gaussian is then updated with this pixel value using an exponential forgetting scheme. This allows online adaptation to changing imaging conditions such as changes in lighting or objects that stop moving. Pixel values are labeled as foreground when they are associated with uncommon Gaussians or when they do not match any Gaussians well. This approach lends itself to realtime implementation and works well when the camera does not move and neither does the "background." However, for most applications we want objects such as branches and leaves waving in the wind, white wave crests on water, and water fountains to be considered as background even though they involve motion. Because these dynamic textures cause large changes on an individual pixel level, they typically fail to be modeled well under a fully independent pixel model. In Figure 1d, we see how our foreground mask from Stauffer and Grimson's method not only (correctly) includes all of the people, but also includes a lot of extra pixels due to image noise and moving trees.

More recently, Mittal and Paragios [2] have used the most recent N frames to build a non-parametric model of RGB and optical flow, with care taken to handle measurement uncertainty. While their approach is still pixel-based, they are able to produce impressive results when the same motions are observed multiple times in the given time window. Challenges are likely to occur when infrequent motions occur, such as trees rustling periodically due to wind gusts. The choice then comes down to misclassifying affected pixels as foreground or increasing the time window (resulting in a linear performance cost). For a 200-frame window, their approach is at least an order of magnitude slower than [1].

Jojic and Frey [3] have taken a radically different approach, extending a model proposed by Wang and Adelson [4]. They consider an image to be generated by a collection of layers, where near layers occlude far ones. Their model assumes that the number of layers and their depth ordering are known and fixed. Each layer is free to translate across the image. Extensions exist such as Winn and Blake's [5] affine motion model. Because finding the optimal solution is intractable, they employ variational approximations to their model. Unlike the other methods mentioned, their approach is batch-mode, so it cannot be used as-is on continuous video feeds.

Our Model(s)

We are developing several background and image models that address shortcomings in the aforementioned ones. Our current models treat image generation as a "spray-painting" process depicted in Figure 2. Our images are painted by a set of spray paint cans (indexed by j) parameterized by their x,y nozzle location μj, Gaussian spatial extent Σj, and strength of the spray wj. The amount of paint landing on pixel i from can j is given deterministically by pij = N(li; μj, Σj) × wj, where li is the x,y position of pixel i and N(...) is the normal distribution. We model the spray as a series of completely opaque pixel-sized flecks of paint that land on the canvas. On a particular pixel, the flecks of paint from different cans land in random order, with the last fleck determining the color. Each fleck's color is drawn from its can's color distribution. In Figure 2, K is the number of paint cans, zi is the identity of the winning can for pixel i, and ci is the color of pixel i (drawn from the color distribution identified by zi).

Relative to a pure pixel-process model such as [1,2], our model assumes that there is spatial extent to the objects in the world that generate pixel values. To the extent that the objects are composed of large flat-shaded regions, our model produces a much sparser representation. This is because a single Gaussian mixture component (a spray can) can explain data for many pixels instead of each pixel needing its own copy of the component. Additionally, by supplying a non-deterministic distribution over μj, we can model uncertainty in the spatial location of the cans. This is useful for modeling situations such as leaves blowing in a direction parallel to the image plane. By supplying a distribution over Σj, we can model changes in size due to movement perpendicular to the image plane. Areas of exploration include the interplay of the uncertainty in size and the uncertainty in position of Gaussians and methods for efficiently and robustly estimating the zi map from data.

Our model also offers advantages over existing layered models [3,4,5]. They typically assume that only a small, fixed number of layers are present in the image, where each layer represents a single independently-moving object or part of an object. Our model instead tracks individual patches of color. This allows us to handle an arbitrary number of objects. If we allow new paint cans to be generated when the data fits our model poorly, our method may be used in online scenarios where the number of objects in the scene is constantly changing. We can also produce layers by grouping together cans that produce coherent motions consistent with a single object model. Because we do not assume some fixed layer ordering, we can also handle situations such as two people walking around each other in a circle where at different points in time, the depth ordering changes.

Versions of our model that simplify the zi estimation are implementable as realtime or quasi-realtime algorithms. We expect our fuller model given in Figure 2 to be at least performance-competitive with full layered models while giving us added flexibility and better performance for busy scenes.


This research is funded by the Defense Advanced Research Projects Agency (DARPA).


[1] Chris Stauffer and W.E.L. Grimson. Adaptive Background Mixture Models for Real-time Tracking. In The Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 246--252, 1999.

[2] Anurag Mittal and Nikos Paragios. Motion-based Background Subtraction Using Adaptive Kernel Density Estimation. In The Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2004.

[3] Nebojsa Jojic and Brendan Frey. Learning Flexible Sprites in Video Layers. In The Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2001.

[4] John Wang and Edward Adelson. Representing Moving Images with Layers. In The IEEE Transactions on Image Processing, vol. 3, no. 5, pp. 625--638, Sept., 1994.

[5] John Winn and Andrew Blake. Generative Affine Localization and Tracking. In Advances in Neural Information Processing Systems, vol. 17, pp. 1505--1512, 2004.

vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu