Automatic Synthesis of Cache-Coherence Protocol Processors Using BluespecMan Cheuk Ng, Nirav Dave & ArvindIntroductionIt has been shown previously that cache-coherence protocols can be described precisely and naturally using TRSs [1]. Such descriptions allow separating the protocol into mandatory and voluntary rules which in turn has been shown to reduce the verification effort. Unfortunately, when it is time to implement these verified cache coherence protocols in hardware, designers are often unable to translate the high-level descriptions directly into hardware, either due to inappropriateness of the protocol mechanisms for hardware, or other real-world constraints (e.g. timing, area). Any modification of the protocol during the implementation stage essentially opens up the Pandora's box of verification. Ordinary verification techniques do not work well with cache coherence protocol implementations because of their time-dependent, nondeterministic behaviors. In fact, the task of proving that the implementation of a protocol is correct is as difficult as the task of verifying the correctness of the cache coherence protocol in the first place. In this project, we will show that the TRS descriptions of cache-coherence protocols can be expressed in Bluespec SystemVerilog (BSV), a hardware description language based on guarded atomic actions. Further, we will show that, by using BSV compiler, we can automatically synthesize a hardware implementation. Such high-level synthesis will dramatically raise our ability to experiment with protocols and processor-memory interfaces in multiprocessor systems. ApproachIt has been shown previously that it is straightforward to translate TRS rules into BSV [2, 3]. Unfortunately, we find that a direct translation does not automatically guarantee an efficient and correct implementation of a cache-coherence protocol processor and requires the following modifications:
ProgressWe have implemented a non-blocking MSI cache coherence protocol for a DSM system (Figure 1). The TRS description of the protocol was presented in the Computer System Architecture course in Fall 2004 at MIT. After we have translated the TRS rules into BSV rules, we made the necessary modifications to ensure the correctness by following the methodology discussed in the previous section: First, we divided our design into two kind of modules: Cache-side Protocol Processor (CPP) modules and Memory-side Protocol Processor (MPP) modules. Each kind is responsible for protocol traffic at cache-side and memory-side respectively. Second, for buffer management, the protocol classifies the request messages from the caches to the main memory as low priority messages and the reply messages as high priority messages. The protocol requires that a low priority message can never block a high priority message indefinitely. We achieves this by having separate buffer queues for each type of messages. On the other hand, there is no restriction on the sizes of all the buffers existed in the protocol since size variations only affect the performance but not the correctness for this protocol. Finally, for the fairness issue, we added an arbitrator to the MPP so that the low priority messages from each CPP are guaranteed to be processed by the MPP eventually. The final design is parameterized by the number of CPPs, the number of MPPs, the size of memory in each of these units, and the sizes of various queues. The design can also be modified with some effort so that it can accept various cache organizations (e.g., direct mapped, set associative) as a parameter. It only took 1050 lines of BSV code to represent the processor for this protocol. Once the protocol was defined in the tabular form, the design was completed in 3 man-days by two of the authors. Both the designers had expert level understanding of BSV and the MSI protocol presented. Nearly half of this time was spent finding and correcting typographical errors in the translation. No other functional errors were found or introduced in this encoding phase. The rest of the time was spent in tuning for performance and setting up the test bench. The final design was tested and synthesized for both 4 and 8 instantiated caches. Compilation of design remained roughly constant, taking 333 seconds for a 4 cache system , and 339 seconds for an 8 cache one. For testing, A directed random trace of cache requests was fed to each CPP. The testbench sustained an average latency of 1.3 cycles per lookup with a single-cycle latency cache, and a 10 cycle latency main memory with miss rate at roughly 5%. FutureWe plan to modify out work so that designs generated by our tools will be compatible with the UNUM framework [4], which models various PowerPC microarchitectures for multiprocessor configurations. By doing so, we will be able to generate a series of modular memory systems, each with its own coherence protocol but with identical interfaces to the processor modules. Our setup would provide a much more realistic setting for both protocol verification and performance measurements than traditional simulators. Additionally, we plan to map these BSV designs onto large FPGAs. Research SupportFunding for this work has been provided by the IBM agreement number W0133890 as a part of DARPA's PERCS Projects. References[1] J. E. Stoy, X. Shen and Arvind. Proofs of Correctness of Cache-Coherence Protocols. In The Proceedings of FME'01: Formal Methods for Increasing Software Productivity, pp. 47-71, London, UK, 2001. [2] N. Dave. Designing a Processor in Bluespec. Master's Thesis, MIT, Cambridge, MA, January 2005. [3] Bluespec, Inc. Bluespec SystemVerilog Version 3.8 Reference Guide. Waltham, MA, November 2004. [4] N. Dave, M. Pellauer and Arvind. UNUM: A General Microprocessor Framework Using Guarded Atomic Actions. CSAIL Research Abstracts 2005, MIT, Cambridge, MA, April 2005. |
||
|