MIT CSAIL Research Abstracts

Inferring Dynamic Dependency Structure for Object Interaction Analysis

Michael R. Siracusa & John W. Fisher III

Problem

We are interested in analyzing scenes in which multiple objects such as vehicles or pedestrians are moving around in an environment. Given noisy measurements of their position for a long range of time, a question one may ask is, which, if any, of these objects are interacting at each point in time. The answer to this type of question is useful for social interaction analysis, anomalous event detection and automatic scene summarization. For example, Heider and Simmel [1] showed that when describing a cartoon of simple shapes moving on a screen, one will tend to described their behavior with terms such as "chasing", "following" or being independent of one another. These semantic labels describe the dependency between objects and allow humans to provide a compact description of what they are observing.

Approach

We view this problem as a specific example of a general class of problems which we refer to as dynamic dependency tests. A dynamic dependency test answers the following question: given multiple data streams, how does their interaction evolve over time? Here, interaction is defined in terms of changing graphical structures, i.e., the presence or absence of edges in a graphical model. For example, in the object interaction case, "X is following Y" implies a casual dependency between Y and X.

We cast this problem as one of inference on a special class of probabilistic models in which a latent state variable indexes a discrete set of possible dependency structures on measurements. We refer to this class of models as dynamic dependence models and introduce a specific implementation via a hidden factorization Markov model (HFactMM). This model allows us to take advantage of both structural and parametric changes associated with changes in the state of interaction of a set of objects.

HFactHMMs are also closely related to switching linear dynamic systems (SLDS) models used by the tracking community[2]. SLDS models are combinations of discrete Markov models and linear state-space dynamical systems. The hidden discrete state chooses between a predefined number of state-space models to describe the data at each point in time. SLDSs are primarily used to help improve tracking and track interpretation by allowing changes to the state-space model parameters. Exact inference and learning for such models is difficult. However, we are only interested in changes to the dependency structure of the observations and consider the latent dynamic state in the state-space model a nuisance variable. An HFactMM explicitly models varying dependence structure over time and allows for efficient learning and inference. We show how inference on an HFactMM can be used as a subroutine in an approximate inference algorithm for SLDS models.

Figure 1: A model for two objects switching between interacting and moving independently. Following the notation of Contingent Bayesian Networks [3], conditionally labeled edges are only present when the condition is true.

Figure 1 shows a model of two objects switching between moving independently and interacting. S is the hidden interaction state we are interested in, X1 and X2 are the dynamic state for each object and O1 and O2 are the observations we recieve (e.g. noisy estimates of position). Note that the conditionally labeled (blue) edges are only present when the interaction state is 1. Given a sequence of observations our task is to estimate the hidden interaction state at each point in time.

One approach is to perform coordinate ascent. Conditioned on knowing the interaction state S we performing learning and inference on the dynamic state X. That is, giving S perform EM to learn the state-space model parameters describing the dynamics on X and the observation noise process. This is followed by Kalman smoothing to get an estimate of X. Next, conditioned on X the model takes the form of a first order HFactMM which allows one to perform learning and inference on the interaction state S using EM and Viterbi decoding. These two steps are repeated until convergence. See [4] for more details.

Progress

In [4] we have evaluated the simple approach described above on both controlled synthetic and recorded data of people/objects interacting. Figure 2 show a sample video frame from one of the data sequences in which two individuals move around an enclosed environment and are randomly instructed to "interact" for a limited period of time. The sequences were recorded using a camcorder at 12 fps. Background subtraction and color filtering with smoothing provided a simple blob tracker that was further guided/tracked by hand to provide noisy estimates of position for each individual for each frame. Results for two simple sequences are shown in the following videos:

Sequence 1 [.avi , H.264 .mov] - Sequence 2 [.avi , H.264 .mov].
The border of the video turns red when our method infers interaction.

Figure 2: An annotated frame from a scene in which two people switch between interacting and moving independently.

Future work will focus on exploring issues associated with model mismatch and the combinatorics associated with having more than 2 objects and unknown possible dependency structures. This is in addition to our general focus of developing simple, tractable and efficient algorithms and models for the general class of dynamic dependency test.

References:

[1] F. Heider and M. Simmel, An experimental study of apparent behavior. In American Journal of Psychology, 1944, vol 57 pp 243-256.

[2] Z. Ghahramani and G.E. Hinton, Variational learning for switching state-space models. In Neural Computation, 1998, vol. 12, pp. 963-996.

[3] B. Milch, B. Marthi, D. Sontag, S. Russell, D.L. Ong, and A. Kolobov, Approximate inference for infinite contingent bayesian networks. In AIStats, 2005.

[4] M.R. Siracusa and J.W. Fisher III, Inferring Dynamic Dependency with Applications to Link Analysis. In Asilomar Conference on Signals, Systems, and Computers 2006.