MIT CSAIL Research Abstracts

Infants and adults are adept at inferring agents' goals from observations of behavior. Often these observations are ambiguous or incomplete, yet we confidently make goal inferences from such data many times each day. The apparent ease of goal inference masks a sophisticated probabilistic induction. There are typically many goals logically consistent with an agent's actions in a particular context, and the apparent complexity of agents' actions invokes a confusing array of explanations, yet observers' inductive leaps to likely goals occur effortlessly and accurately.

Unlike humans, computers have great difficulty with goal inference. By building computational models of human goal inference, we hope to close the gap between human and computer performance. Here, we propose a computational framework for goal inference in terms of inverse probabilistic planning. It is often said that "vision is inverse graphics": computational models of visual perception -- particularly in the Bayesian tradition -- often posit a causal physical process of how images are formed from scenes (i.e. "graphics"), and this process must be inverted in perceiving scene structure from images. By analogy, in inverse planning, planning is the process by which intentions cause behavior, and the observer infers an agent's intentions, given observations of an agent's behavior, by inverting a model of the agent's planning process.

To evaluate specific models within our framework, we compare predictions of the models with predictions of human subjects on artificial stimuli designed to probe a broad range of goal inferences. Here, we describe the specific model that best explains people's judgments across several experiments we have conducted. We describe one of these experiments to illustrate our basic experimental methodology.

Inverse planning framework

At its core, the inverse planning framework assumes that human observers represent other agents as rational planners solving MDPs. The causal process by which goals cause behavior is generated by probabilistic planning in MDPs with goal-dependent reward functions. Using Bayesian inference, this causal process can be integrated with prior knowledge of likely goal structures to yield a probability distribution over agents' goals given their behavior.

Let

be the set of agent states, let

be the set of environmental states, let

be the set of goals, and let

be the set of actions. Let $s_t\in S$ be the agent's state at time

, let $w\in W$ be the world state (assumed constant across trials), let $g \in G$ be the agent's goal, and let $a_t \in A$ be the agent's action at time

. Let $P(s_{t{+}1}\vert s_t,a_t,w)$ be the state transition distribution, which specifies the probability of of moving to state $s_{t{+}1}$ from state

, as a result of action

, in world

. In general, the dynamics of state transitions depend on the environment, but for the stimuli considered in this paper, state transitions are assumed to yield the desired outcome deterministically.

Let $C_{g,w}(a,s)$ be the cost of taking action

in state

for an agent with goal

in world

. In general, cost functions may differ between agents and environments. For our 2D motion scenarios, action costs are assumed to be proportional to the length of the resulting movement (staying still incurs a cost as well). The goal state is absorbing and cost-free, meaning that the agent incurs no cost once it reaches the goal and stays there. Thus, rational agents will try to reach the goal state as quickly as possible.

The value function $V^{\pi}_{g,w}(s)$ is defined as the infinite-horizon expected cost to the agent of executing policy $\pi$ starting from state

(with no discounting):

$\displaystyle V^{\pi}_{g,w}(s) = E_{\pi}\Biggl[\sum_{t=1}^{\infty} \sum_{a_t} P_{\pi}(a_t\vert s_t,g,w) C(a_t,s_t) \Biggl\vert\Biggr. s_1 = s \Biggr].$

(1)

$Q^{\pi}_{g,w}(s_t,a_t) = \sum_{s_{t+1}} P(s_{t+1}\vert s_t,a_t) V^{\pi}_{g,w}(s_{t+1}) + C_{g,w}(a_t,s_t)$ is the state-action value function, which defines the infinite-horizon expected cost of taking action

from state

, with goal

, in world

, and executing policy $\pi$ afterwards. The agent's probability distribution over actions associated with policy $\pi$ is defined as $P_{\pi}(a_t\vert s_t,g,w) \propto \exp\bigl(\beta Q^{\pi}_{g,w}(s_t,a_t)\bigr)$ , sometimes called a Boltzmann policy. This policy embodies a "soft" principle of rationality, where the parameter $\beta$ controls how likely the agent is to deviate from the rational path for unexplained reasons.

The general inverse planning framework we have described includes many specific models that differ in the complexity they assign to the beliefs and desires of agents. Fig. 1 illustrates the specific model we consider here in graphical model notation. This model assumes that agents' goals can change over time according to a Markov process. Although the graphical model representation captures the relationships between variables in the model, inverse planning is required to construct the CPTs that determine the agent's policy.

**Figure 1:** Graphical model. This represents the statistical dependencies among variables in the model, and determines the computations necessary for goal inference.
$\begin{figure} \centerline{\epsfig{file=fig1,width=\textwidth}} \end{figure}$

Model predictions are generated by computing the posterior probability of goals, given observations of behavior. To compute the posterior distribution over goals at time

, given a state sequence $s_{1:t+1}$ , we recursively define the forward distribution:

$\displaystyle P(g_t\vert s_{1:t+1}) \propto P(s_{t+1}\vert g_t,s_t) \sum_{g_{t-1}} P(g_t\vert g_{t-1}) P(g_{t-1}\vert s_{1:t}),$

(2)

Experiment and results

Participants were 16 members of the MIT community. They were told they would watch 2D videos of intelligent aliens moving around in simple environments with visible obstacles, with goals marked by capital letters.

There were 100 stimuli in total. An illustrative subset is shown in Fig. 2(a). Each stimulus contained 3 goals. There were 4 different goal configurations, and two different obstacle conditions: gap and solid, for a total of 8 different environments. There were 11 different complete paths: two paths headed toward 'A', two paths headed toward 'B', and 7 paths headed toward 'C' (to account for C's varying locations). Partial segments of these paths starting from the beginning were shown in each different environment. Because many of the paths were initially identical, and because many of the paths were not possible in certain environments (i.e. collided with walls), the total number of unique stimuli was reduced to 100. Stimuli were presented shortest lengths first in order to not bias subjects toward particular outcomes, and stimuli of the same length were shown in random order.

After each stimulus presentation, subjects were asked to rate which goal they thought was most likely (or if two or more were equally likely, to pick one of the most likely). After this choice, subjects were asked to rate the likelihood of the other goals relative to the most likely goal, on a 9-point scale from "Equally likely", to "Half as likely", to "Extremely unlikely". Ratings were normalized to sum to 1 for each stimuli, then averaged across all subjects and renormalized to sum to 1. Example subject ratings are plotted with standard error bars in Fig. 2(b).

Our model makes strong predictions about people's ratings in this experiment. Model predictions are computed using Eq. 2. Example model predictions are plotted in Fig. 2(c); these match subjects' ratings very closely. The correlation of the model predictions with subjects' ratings across all 100 stimuli was very high, with a correlation coefficient of r = .96.

**Figure 2:** (a) Example stimuli. Plots show all 4 goal conditions and both obstacle conditions. Both 'A' paths are shown, one of two 'B' paths are shown, while 2 of 7 'C' paths are shown. Dark colored numbers indicate displayed lengths. (b) Average subject ratings with standard error bars for above stimuli. (c) Model predictions. Model predictions closely match people's ratings, and model predictions and subject's ratings correlate highly (r = .96).
$\begin{figure} \centerline{\epsfig{file=fig1-5,width=\textwidth}} \end{figure}$

Discussion

This abstract is based on work from [1] and [2]. These papers describe further experiments that were conducted, and provide comparisons with several alternative models. Overall, our results provide support for the specific model described in this paper, and for the inverse planning framework more generally.

Ongoing work is aimed at extending our results to more realistic and complex environments, and exploring the capacity of our framework to predict people's inferences about other mental states, such as belief. For further details on this project, see Chris Baker's research webpage.

Research support

CLB was supported by a Department of Homeland Security Graduate Fellowship. Further support provided under AFOSR Contract# FA9550-05-1-0321.

References

[1] Chris L. Baker, Joshua B. Tenenbaum & Rebecca R. Saxe. Bayesian models of human action understanding. In Advances in Neural Information Processing Systems 18, 2006.

[2] Chris L. Baker, Joshua B. Tenenbaum & Rebecca R. Saxe. Goal inference as inverse planning. In Proceedings of the Twenty-Ninth Annual Conference of the Cognitive Science Society, submitted.

Goal Inference as Inverse Planning

Chris L. Baker, Rebecca R. Saxe & Joshua B. Tenenbaum

Introduction

Inverse planning framework

Experiment and results

Discussion

Research support

References