Hardware Synthesis of Guarded Atomic Actions with Performance GuaranteesDaniel L. Rosenband & ArvindIntroductionRule-based synthesis, for example Bluespec, has been quite successful in providing designers a methodology and a synthesis tool that eliminates functional bugs related to complicated race conditions in hardware designs[1,2]. The Bluespec synthesis tool has also been shown to generate circuits (RTL) that are comparable in quality, i.e. area and timing, to hand-coded Verilog[3]. In rule based systems all legal behavior can be explained as a sequence of atomic actions on the state. However, in order to achieve good performance, much effort has been placed into automatically generating implementations that fire many rules concurrently while maintaining the appearance of atomic and sequential execution. This leads to one of the problems with rule-based synthesis: performance, i.e. scheduling efficiency, is often unpredictable and at times does not satisfy the designer's requirements. The goal of this research is to add performance specifications to rule-based synthesis. The designer should write rules as before, but can now also add performance constraints. The compiler then transforms the design into a functionally equivalent design which is guaranteed to satisfy the performance constraints (provided sufficient resources are available). Performance guarantees in digital design are as important as correctness in the sense that if they are not met we have not implemented an acceptable design. Suppose we have a pipelined processor which executes programs correctly but its various pipeline stages cannot fire concurrently because of some ultraconservative interlocking scheme. We are unlikely to accept such a design. In the reorder buffer (ROB) of a modern 2-way superscalar processor, the designer may not feel that the design task is over until the ROB has the capability of inserting two instructions, dispatching two instructions and writing-back the results from two functional units every cycle. Even simple micro-architectures (and not just related to processors) can present designers with such performance-related challenges[3]. It is important to understand that such requirements emanate from the designer of the micro-architecture as opposed to something that a high-level synthesis tool can automatically derive. To that extent, only the designer can provide such specifications, and performance specifications should be part hardware synthesis flows in which the tool makes scheduling decisions. Such specifications are particularly important for rule-based synthesis because it relies heavily on a compiler to decide which parts of a design can execute concurrently. It assumes the design is specified as a collection of rules which the methodology requires must appear to execute atomically and sequential when implemented. Using sophisticated scheduling algorithms, the compiler can often execute many of the rules within the same cycle while maintaining the appearance of sequential and atomic rule execution. However, as mentioned above, the compiler cannot always determine what the designers intended schedule is, and due to some fundamental limitations of Bluespec's original scheduler, cannot achieve all possible schedules. Thus, performance specification and new scheduling algorithms should make it easier to generate and experiment with alternate micro-architectures. ApproachWe take the following approach to specifying performance constraints: the user simply lists which sets of rules should execute simultaneously within each cycle, and in what order they should appear to execute. For example, given a standard 5-stage processor pipeline, the designer would specify the following schedule. Here "<" means that the left hand side and the right hand side must execute within the same cycle (provided both rules' guards are true) and that it should appear as though the left hand side executed first: Writeback Rule < Memory Rules < Execute Rules < Decode Rules < Fetch rule We then use a fundamental property of rule-based descriptions to generate a design that satisfies these constraints: if a new rule being added to a design is a so called derived rule then it does not add any new behaviors, and hence it is always safe to add derived rules to a design. One such derived rule is the composite rule. The composite rule Rab of two rules Ra and Rb is defined such that rule Rab must exhibit the behavior of Ra followed by Rb. Thus, our goal is to add composite rules that correspond to all possible subsets of rules that the user-prescribed schedules requires to execute simultaneously when they are enabled to the design . Clearly, generating a new composite rule for each subset of executable rules is impractical for large designs since the number of rules in the design would grow exponentially[5]. Instead, we have developed an algorithm that transforms each rule individually such that it behaves as part of a composite rule when enabled at the same time as other rules. The basis behind this algorithm is a new state-element: the Ephemeral History Register (EHR) [4], and a renaming scheme for all methods and state elements that a rule calls. The basic idea is that we can remember what values one rule writes and ensure that the rule that follows it in the composition observes the values of the earlier rule. After replacing all registers with EHR's and applying the renaming algorithms we are guaranteed to obtain a design that satisfies the performance requirements and is also guaranteed to be functionally equivalent to the original design. Progress and FutureWe have tested the synthesis algorithms on simple circuits and have shown that the generated circuits are efficient and that they satisfy the performance guarantees. By simply changing the scheduling constraints we can easily perform interesting design transformations and evaluate their impact on both cycle time and cycle count. This simplifies architectural exploration since the designer can obtain rapid feedback on these critical design criteria. For example, we can transform a processor into a superscalar design[5] by changing the above schedule to: Wb < Wb < Mem < Mem < Exe < Exe < Dec < Dec < Fetch < Fetch In the future we will apply these transformations to larger designs and increase the power of the performance specification language. References[1] Arvind and X. Shen. Using term rewriting systems to design and verify processors. In IEEE Micro, 19(3) , pp. 36--46, 1999 [2] J. C. Hoe and Arvind. Operation-centric hardware description and synthesis. In IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 23(9), pp. 1277-1288, 2004. [3] Arvind, R.S. Nikhil, D. L. Rosenband and N. Dave. High-level Synthesis: An Essential Ingredient for Designing Complex ASICs. In ICCAD, San Jose, CA, October 2004. [4] D. L. Rosenband. The Ephemeral History Register: Flexible Scheduling for Rule-Based Designs. In MEMOCODE, San Diego, CA, June 2004. [5] M. Lis. Superscalar Processors via Automatic Mircoarchitecture Transformations. In M.Eng. Thesis, Massachusetts Institute of Technology Cambridge, MA, June 2000. |
||
|