Abstracts - 2006
A Stored-Processor Abstract Machine Supporting a Recursive Cellular Virtual Machine
Heidi Pan, Rose Liu & Krste Asanović
Technology scaling will soon enable hundreds of processor cores to be placed on a die, and thousands of cores to be incorporated in low-cost systems, dramatically extending the range of applications in which massively parallel architectures will be used. We believe conventional parallel programming models, operating systems, and architectures cannot be independently and incrementally improved to provide the levels of performance scaling, programmability, self-management, dependability, or real-time responsiveness that will be required to support future applications, such as consumer robotics and autonomous vehicles. Thus, we propose a new architectural paradigm, the stored-processor computer, along with a recursive cellular virtual machine model, that aim to satisfy these needs.
The Stored-Processor Abstract Machine
The stored-processor abstract machine puts all system state in a global memory space, including the state of all hardware threads, translation, protection, scheduling, and inter-processor synchronization. The state is represented by data structures that are easy to manipulate in software, yet can be easily parsed by hardware engines. Analogous to how stored-program computers increased flexibility and usability of early computers by allowing instructions to be treated as data, the stored-processor model increases the capabilities of massively parallel architectures by allowing parallel program state to be manipulated as data.
Our abstract machine is highly concurrent. Hardware threads are virtualized as stored processors that are decoupled from the physical processing engines. A stored processor is simply defined by a fixed-size data structure residing in globally accessible memory, including general-purpose user registers and privileged state. Software can create as many hardware thread contexts as needed, limited only by the size of memory rather than by the number and size of the engines. Hardware executes instructions from these memory-resident threads directly, so software does not have to manage moving threads onto the engines. In practice, the threads are cached in physical registers while running, so the stored-processor architecture has no inherent performance penalty over conventional architectures.
We provide a parallel execution model with simple semantics, where computation proceeds as a series of globally atomic steps. Each step is either an instruction executed by a stored processor or an I/O event. Steps are designed to touch at most a small bounded number of memory locations, allowing implementations to use simple hardware micro-transactions on memory state. The abstract machine includes fine-grained hardware synchronization primitives, such as futures and unbounded transactions , which are composed of a finite number of steps, each of which may cause other stored processors to be scheduled or descheduled.
In contrast to conventional machines that rely on 1-2 privileged processor modes and a monolithic operating system interface for protected access to shared resources, our abstract machine maps all resources into the memory space and uses the Mondriaan fine-grained memory protection scheme (MMP)  to grant and revoke access permissions for an arbitrary number of user-defined privilege levels.
The Recursive Cellular Virtual Machine
As shown in Figure 1, we structure the entire system as an arbitrarily nested collection of cellular virtual machines, or cells. A cell is a user-defined unit of computation bundled with its own dynamic partition of the hardware resources (e.g. processing engines, global address space, physical memory, memory bandwidth, energy). Each cell is a pure virtual machine, operating on a complete stored-processor abstract machine model to directly control its hardware resources. However, each parent cell subsumes its child cells, or equivalently, every child cell's computation state and resources are a strict subset of its parent's. This frees a parent cell from having to micromanage its children, while maintaining the right to inspect a child's state or reclaim a child's resources.
Under the stored-processor abstract machine model, the system can continually self-organize to adapt to time-varying workload behavior. Since no computation state is tied to a specific engine, a parent cell can dynamically reallocate engines and other hardware resources among its children (and itself) without disrupting computation. Furthermore, parent cells can use MMP to enforce its resource allocation policies by ensuring that no cell can corrupt computation state or use hardware resources from outside its cell boundaries.
We provide a producer-consumer communication model between cells via portals. Each portal passes discrete, immutable versions of data from a single producer cell to all of its consumer cells. Portals allow producer and consumer cells to be decoupled. Cells can publish and consume at different rates, and can dynamically subscribe and unsubscribe to the portal without disrupting each other.
We have built an execution-driven simulator for the stored-processor abstract machine based on the Pin dynamic binary instrumentation tool  using the techniques described in . Our simulation infrastructure can simulate numerous stored processors with various interleavings running on multiple engines. We model a cellular virtual machine running on the stored-processor architecture, by injecting our architecture simulation code into the cellular code with Pin. The simulator provides the scheduling, synchronization, and protection interface to the cellular software.
In addition, we have designed a flexible and modular framework for constructing cellular virtual machines. We define a bare-bones interface for how cells and various components within a cell can interoperate. We provide basic libraries for a cell that conforms to this interface, which implement stored processor scheduling, engine allocation, resource protection, and subcell performance monitoring. These libraries can be easily used to create basic cells that run existing, unmodified, multithreaded, shared-memory applications on a stored-processor architecture. Our interface is extensible, and enables users to easily replace one or more of these standard libraries with application-specific ones for any cell in the system.
The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA) or the U.S. Government.
This project is sponsored by an NSF Graduate Fellowship, and the Defense Advanced Research Projects Agency (DARPA) through the Department of the Interior National Business Center under grant number NBCH104009.
 C. Scott Ananian, Krste Asanović, Bradley C. Kuszmaul, Charles E. Leiserson, and Sean Lie. Unbounded Transactional Memory. 11th International Symposium on High Performance Computer Architecture (HPCA-11), San Francisco, CA, February 2005.
 Emmett Witchel, Josh Cates, and Krste Asanović. Mondrian Memory Protection. 10th International Conference on Architectural support for Programming Languages and Operating Systems (ASPLOS-X), San Jose, CA, October 2002.
 Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, Kim Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. Programming Language Design and Implementation (PLDI), Chicago, IL, June 2005.
 Heidi Pan, Krste Asanović, Robert Cohn, and Chi-Keung Luk. Controlling Program Execution through Binary Instrumentation. Workshop on Binary Instrumentation and Applications (WBIA), at 14th International Conference on Parallel Architectures and Compilation Techniques (PACT-14), St. Louis, MO, September 2005.