CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line

 

Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Scalable And Deadlock Free Cache Coherence For CMPs

Edya Ladan-Mozes & Charles. E. Leiserson

Introduction

Chip multiprocessors (CMPs) have recently been introduced to overcome the growing demand for computing capacity and better performance. CMP's allow several tasks to be performed simultaneously on different computation units, thereby allowing real parallelism that improves performance. In most current CMP machines, the cache architecture is based on a private L1 cache for each core and a shared L2 cache. The way in which the L2 is shared differs between manufacturers, but most of them use a shared bus to facilitate the communication between the cores [1][2][3]. The bus also serves as a serialization point. Thus, as the number of cores increases, the contention on the bus increases and the performance deteriorates. The cache-coherency protocol used in these systems is based on the serialization of the bus and the ability of all caches and memory to listen to the messages passed on it (snoopy based).

This project's goal is to investigate a new, multilevel, modular, shared-cache architecture for CMP's, as well as a supporting asynchronous cache-coherence protocol. The new architecture combines a routing protocol with a cache-coherency protocol that distributes and reduces the communication between the processor's private caches and the main memory. To do so, we use levels of shared caches, where a cache in one level is shared by one or more caches in the next level. A line in the shared cache contains state information and the actual content of the line. The caches communicate by sending messages. We believe that this architecture can scale better than bus-based coherency protocols, provide efficient data sharing among CPU's working on the same task, and reduce the number of accesses to main memory.

The Cache Architecture and Coherency Protocol

Although we intend the architecture to use a robust routing network, such as a fat-tree or butterfly-based network, our initial design places shared caches as nodes in a binary tree. In this structure, the root is the main memory and the leaves are the private caches attached to each processor. Each intermediate node in the tree represents a cache that is shared between its children. The shared caches are assumed to be large enough to accommodate all lines needed by the children (that is, the size of each shared cache is at least the sum of the sizes of its children). In a binary tree structure each shared cache can communicate only with its two children and parent. Each private cache (leaf) can communicate only with its parent.

The cache coherency protocol we devised is based on the MSI protocol [4]. Each cache line at each shared cache can be in one of the MSI states, and additional information is kept per line that represent the knowledge of this cache on the states of its neighbors in the binary tree. Caches communicate by sending messages between parent and children. Our cache coherency protocol is highly asynchronous and each cache can handle any combination of messages (including conflicting requests). The only restriction imposed by the protocol's design is that a cache does not send a second message to its neighbor regarding the same line before it receives a response to its first message. This asynchronicity does not exist in shared-bus architectures, where the bus serves as a communication channel as well as a serialization point. In a bus architecture, only one message is broadcasted at a time, and all the caches on the bus listen to this message. The proposed architecture contains no such serialization point, and thus there is no total order between the requests and replies of all caches and no global knowledge on the state of the cache lines. The protocol we devised handles this asynchronicity and provides coherency despite the lack of global knowledge. This protocol can be easily extended to support the Exclusive state as well (as in the MESI protocol).

Recently, a similar approach was taken by Eisley at el. [6]. They purpose a coherency implementation in which directories are embedded within the routing nodes. The protocol dynamically builds a tree of sharers for a line and forward requests to nearby copies. In their design, however, all write requests and conflicting requests must be forwarded to the home node. They solve the deadlock situation that might arise when a line is invalidated using timeouts. In contrast, our protocol is deadlock free without the need for timeouts, and conflicts are less likely to occur. Several studies on interconnects on CMPs [5], [10], [12] suggest ways to reduce coherence traffic and propose different interconnects other than a bus. There are many works published on cache coherency and multilevel caching, only few are mentioned here [7] [8] [11]. [7] uses a bus based protocol and is less modular and [11] limits the number of pointers to the data and thus the directory is distributed. A recovery method is used when more pointers are needed.

Our new cache-coherency protocol is now being implemented in Bluespec [5], a hardware-design tool that simplifies the expression of designs and models using rules and interface methods. Having this implementation will help us in gaining a better understanding of the advantages and disadvantages of such architecture and in comparing it to others models.

Future Work

This research is still in its early stages. We plan to explore the proposed architecture both from the theoretical aspects and the practical and feasibility aspects. We plan to extend the design to richer intereconnect structures, as well as consider a physical chip design.

Research Support:

This research is supported in part by the NSF Grant CNS-0615215 and in part by a gift from Intel Corporation.

References:

[1] OpenSPARC T1 Microarchitecture Specication : http://opensparc- t1.sunsource.net/specs/OpenSPARCT1 Micro Arch.pdf>.

[2] It Takes An Architecture To Make A Multi-core Processor : http://multicore.amd.com/Resources/MultiCoreArchitecturePaper.pdf.

[3] Power5 System Microarchitecture : http://www.research.ibm.com/journal/rd/494/sinharoy.html.

[4] David A. Patterson and John L. Hennessy. Computer Organization And Design (2nd ed.): The Hardware/Software Interface. San Francisco, CA, USA, 1998.

[5] Bluespec SystemVerilog : http://www.bluespec.com/ .

[6] Eisley, N.; Peh, L. & Shang, L. In-Network Cache Coherence MICRO '06: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 2006, 321-332

[7] Cheriton, D. R.; Goosen, H. A. & Boyle, P. D. Multi-level shared caching techniques for scalability in VMP-M/C ISCA '89: Proceedings of the 16th annual international symposium on Computer architecture, ACM Press, 1989, 16-24

[8] Chang, J. & Sohi, G. S. Cooperative Caching for Chip Multiprocessors ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture, IEEE Computer Society, 2006, 264-276

[9] Kumar, R.; Zyuban, V. & Tullsen, D. M. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling ISCA '05: Proceedings of the 32nd Annual International Symposium on Computer Architecture, IEEE Computer Society, 2005, 408-419

[10] Cheng, L.; Muralimanohar, N.; Ramani, K.; Balasubramonian, R. & Carter, J. B. Interconnect-Aware Coherence Protocols for Chip Multiprocessors ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture, IEEE Computer Society, 2006, 339-351

[11] Chaiken, D.; Kubiatowicz, J. & Agarwal, A. LimitLESS directories: A scalable cache coherence scheme ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, ACM Press, 1991, 224-234

[12] Dally, W. J. & Towles, B. Route Packets, Not Wires: On-Chip Interconnection Networks Design Automation Conference, 2001, 684-689

 

vertical line
vertical line
 
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu