Introduction Architecture, Systems & Networks Language, Learning, Vision & Graphics Physical, Biological & Social Systems Theory

Learning Video Processing

Ali Rahimi, Ben Recht & Trevor Darrell

Many vision applications can be expressed as mapping one time series to another. For example tracking can be expressed as learning a mapping from each frame of a video sequence to an intrinsic time-varrying attribute of the scene, such as the position of the limbs of a person in the scene. We seek to learn such mappings from examples comprised of a few example mappings between frames and the desired attribute that should be extracted from the frame.

To improve the performance of the learned function, the algorithm uses information in both the examples and the all unlabelled frames in the sequence.

Here is an example input video [mpeg] of Ben flailing his hands. For a few frames, we specified the position of his joints. From these frames and the rest of the video, the algorithm learned a function that took as input an image, and returned the position of Ben's shoulders, elbows, and hands.

Specified Examples
Figure 1: We provided the desired output for 12 frames of a 1500 frame sequence. For these frames, we specified the location of the hand, elbow and shoulder of the subject in the scene.
Applying the Function
 Figure 2: Once learned, the function is applied to each frame of the 1500 frame sequence. The output is overlaid in white here. Black overlays are the output of a function that was learned using only the provided examples, ignoring the unlabelled data. Using unlabelled data improves the performance of the function.

Applying the Function (out of sample)
 Figure 3: The function can also be applied to frames that were not in the training sequence. These 3 frames are from another video sequence that was not supplied as unlabelled training data. The learned function can be applied to any image that is similar to the training data set.

This video [avi] compares simple regression (black marketers) with semi-supervised regression (white markers). The above figures show the training data and some snapshots from this video.

This techniques is closely tied to manifold learning, system identification, and semi-supervised learning. For more details, please see [1].

References:

[1] A. Rahimi, B. Recht, T. Darrell. Learning Appearance Manifolds from Video. In The Proceedings of Computer Vision and Pattern Recognition, San Diego, CA, USA, July 2005. (pdf)

 Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu (Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)