CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Region Tracking Using Dynamic Models of Texture & Motion

Joshua Migdal, John Fisher & Eric Grimson


The tracking of moving objects within a scene is a deceptively simple problem. After all, if an object is in the scene at one time, it cannot have moved very far in the time between frames, which is typically on the order of tens of milliseconds, nor could it have changed significantly in appearance. However, a number of factors conspire to confound current tracking methods. In blob-based trackers, for example, a common failure case is that an object passes in front of a portion of background whose appearance is similar to that of the tracked object, and the blob loses cohesion and gets broken into multiple pieces. Once this occurs, it is difficult to maintain that object's track, as the object is now modeled as a number of independent blobs. On the other hand, a common difficulty facing template-based trackers, which require the appearance of the tracked objects to be known, is that the appearance and shape of an object rarely stay the same throughout the image sequence and may, in fact, change in complex ways. Lastly, a problem facing both types of tracking paradigms is that of occlusion. Objects can pass in front of other objects, and they can also pass in front of or behind stationary structures in the scene as well. When this occurs, we lose the ability to detect the object from the imagery, as it is no longer there.

Cognizant of the challenges described above, we seek to track regions distinguishable in terms of their joint texture and motion properties. Furthermore, we seek to model such regions using a joint dynamic on appearance and shape that allows for the complex and nonstationary deformations that a region can undergo.


Our model consists of a linear dynamic on parameters of both shape and appearance that are used to predict the location, pose, and pixelwise appearance of a region of an image at some time t. The shape parameters used are the vertices of a tessellated mesh computed over a given region R and the appearance parameters are the tracked color values.

Given input regions R1...RN and models M1...MN, the goal is to track their location, shape, and appearance through some extent of time. Since the models are fully generative, probabilistic models, we can use them to estimate occlusion masks and to determine, if some tracked regions should overlap, which are currently visible and where.

To determine the location, shape, and appearance of a region within an image, we first forward predict the models to estimate, in a probabilistic sense, the appearance, shape, and location of each region. Assuming a piecewise-affine family of deformations, and utilizing a spring-like shape prior on that space of deformations, a local search is done, using a gradient descent method, to find each region within the imagery. Care is taken to use only those parts of the model estimated to be visible at the current time. Once the models have been fit to the imagery, they are then updated by incorporating the sample of shape and appearance present within the imagery.


Framing the tracking problem in terms of appearance and shape on regions instead of objects opens up the possibility of tracking whole crowds of people as a cohesive unit. Figures 1 and 3 show results of tracking groups of people. Tracking groups of people moving similarly gives robustness to noise that would not otherwise be possible. In figure 1, for example, several people cut through the tracked group without throwing off the model. Also, people continuously change positions within the group. This type of small, local motion is not treated as a deformation but rather a change to the textured appearance of the whole region.

Figure 2 shows a more classical tracking example. Two benefits of our model are shown here. First, we are able to easily track the two cars through the region of occlusion due to the statistical models we maintain over time. Second, we are able to maintain the shape and appearance of the occluded car, and continue to deform its shape as its pose changes, even through the area of occlusion.

crowd tracking

Figure 1: Tracking crowds as deforming, dynamically textured regions. Top row: imagery overlaid with tracked mesh. Bottom: mean of tracked appearance model.

car tracking

Figure 2: A classic tracking example. The maintenance of a joint probabilistic model of texture and shape allow us to reason about occlusion and to continue tracking through areas of occlusion. Left and right columns show the two cars' mean appearance, as estimated by their models, and their occlusion masks.

crowd tracking on escalator

Figure 3: Tracking a group of people as they approach, and are occluded by, the escalator. Note how the model of the group is maintained even as the first person in the group leaves the scene by going past the observed section of the escalator. Similarly, the legs of the people as they approach the escalator are maintained and a model of the appearance of the group of people continues to be estimated even though they are occluded by the escalator's rails.


vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu