Abstracts - 2007
HAsim: Implementing a Partitioned Performance Model on an FPGA
Michael Pellauer, Joel Emer & Arvind
The Goal of Performance Models: System architects value performance simulators as a way to explore architectural ideas before implementation. To be effective such simulators require three basic properties: speed of design, confidence in their correctness, and speed of simulation. Previous implementations of such simulators in software invariably tradeoff between these goals. For accuracy, software simulators must model a great deal of state which is inherently slow. Even when fidelity is reduced, detailed simulations still perform 3 to 4 orders of magnitude slower than the designs being modeled.
Figure 1. A Partitioned Simulator
Partitioned Models: One accepted approach adapted by software performance models such as Asim is to split the model into two partitions : a functional partition, and a timing partition. The functional partition is responsible for executing the functional aspects of the model's instructions (for example, performing an IEEE floating-poing multiplication). The timing partition, on the other hand, is soley responsible for determining the timing properties (for example, that a floating-point multiply takes twelve clock cycles). This partitioning allows designers to concentrate design and verification effort on the functional partition, and reuse that code across multiple timing models. Additionally, it greatly simplifies the code in the timing partition, which now only has to worry about timing and resource management.
The HAsim Approach: The goal of the HAsim project is to investigate doing performance simulation in hardware, specifically on FPGAs. By putting the model onto an FPGA we expect to operate in the many MIPS range: a 100x improvement over accurate software simulators. We will retain the notion of splitting the design into a timing partition and functional partition. Like in the software approach, the design effort of the functional partition can be amortized across many timing models. In addition, since the timing partition does not have to be concerned with the actual implementation of the instruction set semantics it is easier to design and more likely to meet the physical constraints of an FPGA. To further aid in design efficiency, HAsim is implemented in Bluespec SystemVerilog . This allows us to preserve many Asim abstractions such while now describing hardware rather than software.
Functional Partition: Existing projects have investigated placing just the timing partition onto the FPGA . In HAsim the functional partition is also placed on the FPGA. HAsim breaks the functional partition into a simple pipeline with the following stages: Fetch, Decode, Execute, Memory, Local Commit, and Global Commit. Instructions pass from stage to stage as directed by the timing partition via request/response method pairs, shown as dashed arrows in Figure 2.
Figure 2. The HAsim Functional Partition
Speculation Support: The timing partition is able to reorder instruction operations to simulate out-of-order issue microprocessors. To handle instruction reordering in the functional partition, each stage holds a table from which unprocessed instructions can be executed in arbitrary order. To support this, the functional partition implements register/memory renaming. On a mispeculation the timing partition broadcasts a kill which is used to rollback wrongpath instructions.
Timing Partition: In order to facilitate efficient simulation, we have developed a decentralized protocol for controlling the timing partition. Building on existing approaches [1, 2], this protocol allows various modules in the timing partition to "slip" with respect to each other in model time, improving simulation speed. Additionally, the protocol allows the user to resynchronize the system without the use of expensive global control logic. Thus we can imagine a user warming up the simulator ahead in a fast slipping mode, then entering a slower synchronous mode to step through a critical code section.
Progress and Future Work
Currently, we have a parametrized functional partition with a simple MIPS ISA. The functional partition supports out-of-order issue, register renaming, and snapshots for fast rollback. The memory subsystem supports reordered loads-and-stores via a store buffer. The entire functional partition has been highly optimized for FPGA implementation and currently uses approximately 8000 slices and 25 Block RAMs, less than 55% of the resources available on a small FPGA such as the Xilinx Virtex 2 Pro 30. Work on implementing an x86-like micro-op ISA is ongoing.
A simple timing model for a basic five-stage pipeline is complete We plan to explore how to best implement timing partitions for realistic superscalar and speculative architectures. Support for multicore systems is a critical component of future work. We also hope to use large FPGAs platforms to run models which are otherwise difficult to simulate by current software techniques, e.g., multi-megabyte caches.
 Kenneth C. Barr, Ramon Matas-Navarro, Christopher Weaver, Toni Juan, and Joel Emer. Simulating a Chip Multiprocessor with a Symmetric Multiprocessor. Proceedings of 3rd annual Boston Area Architecture Workshop. Providence, RI, 2005.
 Kenneth C. Barr, Heidi Pan, Michael Zhang, and Krste Asanovic. Accelerating a Multiprocessor Simulation with a Memory Timestamp Record. In International Symposium on Performance Analysis of Systems and Software, Austin, TX, 2005.
 Bluespec, Inc., Waltham, MA. Bluespec SystemVerilog Version 3.8 Reference Guide, November 2004.
 Derek Chiou, Huzefa Sunjeliwala, Dam Sunwoo, John Xu, and Nikhil Patil. FPGA-based Fast, Cycle-Accurate, Full-System Simulators. Number UTFAST-2006-01, Austin, TX, 2006.
 Joel Emer, Pritpal Ahuja, Eric Borch, Artur Klauser, Chi-Keung Luk, Srilatha Manne, Shubhendu S. Mukherjee, Harish Patil, Steven Wallace, Nathan Binkert, Roger Espasa, and Toni Juan. Asim: A performance model framework. Computer, 35(2): pp. 68--72, 2002