
Research
Abstracts  2006 
A Segemented ParallelPrefix VLSI Circuit with Small Delays for Small SegmentsBradley C. KuszmaulSummaryWe developed a VLSI circuit for segmented parallel prefix with gate delay O(log S) and wire delay O(√S) for segment size S, and total area O(N). Thus, for example, for the problem of adding random numbers, in most cases, the addition would complete with only O(log log N) gate delay and O(√log N) wire delay. BackgroundOne of the limitations to increasing clock speed for microprocessors is the time required for a VLSI circuit to perform an addition of two integers [2]. Fastcarrylookahead circuits implement addition of Nbit numbers in order O(log N) gate delays, and, with the proper VLSI layout, in area O(N) and with O(√N) wire delay, which is tight for the general problem of computing a value across an entire chip. Shown below is example layout of a circuit that achieves only O(log S) gate delays and O(√S) wire delay, where S is the length of the longest carry chain, however. The basic idea is that for fastcarry lookahead, the values from some distant parts of the chips can be ignored whenever the carry is ``killed''. The same idea works for any segmented parallel prefix circuit. To achieve these bounds both the gate delays and the wire delays must be kept small. To reduce the gate delays, insert ``shortcuts'' (shown as dashed lines in the figure) between distantcousin subtrees that are adjacent in the carry chain. To reduce the wire delays, lay out subtrees in a modified Htree layout (reflected by the node numbering in the figure) so that the Euclidean distance between leaves i and j is O(√ji). See [5] for more details. Figure 1: A 64node parallel prefix tree with small delays for small segments. The leaves are labeled L_{0} through L_{63}. Shortcut wires are shown as dashed arrows. Related workCheng et al. [3] describe carrylookahead adder circuitry with O(log N) worst case gate delay using selftimed circuitry, and O(log log N) average case gate delay for addition of random numbers. In contrast, this result addresses gate delay, wire delay, and chip area, and also shows how to achieve corresponding results for any segmented parallel prefix circuit. Greenberg [4] shows how to implement fattree routing network with shortcuts that are similar to the shortcuts used here. An analogous performance property for search trees is the ``finger'' property described by Sleator and Tarjan [8]. In the past, parallel prefix circuits have been important for parallel computers (see, e.g., [7]), but we came across this problem in the context of scaling superscalar processors, where it turns out almost everything done by a superscalar processor can be done asymptotically optimally using segmented parallel prefix circuits [6]. This work appeared in [1]. Research SupportThis research is supported in part by the SingaporeMIT Alliance, NSF CAREER Grant 9702980, and NSF Grants ACI0324974, and CNS0305606. References:[1] Bradley C. Kuszmaul, A Segmented ParallelPrefix VLSI Circuit with Small Delays for Small Segments. In The 17th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2005), p. 213, Las Vegas, Nevada. July 2005. (Brief announcement.) [2] D. Brooks and M. Martonosi. Dynamically exploiting narrow width operands to improve processor power and performance. In HPCA'99, pages 1322, January 1999. [3] F.C. Cheng, S. H. Unger, and M. Theobald. Selftimed carrylookahead adders. IEEE Trans. Comput., 49(7):659672, July 2000. [4] R. I. Greenberg. The fatpyramid and universal parallel computation independent of wire delay. IEEE Trans. Comput., 43(12):13581364, 1994. [5] Bradley. C. Kuszmaul. A segmented parallelprefix vlsi circuit with small delays for small segments (full version). May 2005. [6] Bradley C. Kuszmaul, Dana S. Henry, and Gabriel H. Loh. A comparison of asymptotically scalable superscalar processors. Theory of Comput. Syst., 35:129150, 2002. [7] Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Douglas, Carl R. Feynman, Mahesh N. Ganmukhi, Jeffrey V. Hill, W. Daniel Hillis, Bradley C. Kuszmaul, Margaret A. St. Pierre, David S. Wells, Monica C. Wong, ShawWen Yang, and Robert Zak. The network architecture of the Connection Machine CM5. J. Parallel and Dist. Comp., 33(2):145158, 1996. [8] D. D. Sleator and R. E. Tarjan. Selfadjusting binary search trees. J. ACM, 32(3):652686, July 1985. 

