MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Cilk Provides the ``Best Overall Productivity'' for High Performance Computing (and Won the HPC Challenge Award to Prove It)

Bradley C. Kuszmaul

Summary

My entry won award for "Best Overall Productivity" in the 2006 HPC Challenge Class 2 (productivity) competition. I used the Cilk multithreaded programming language \cite{Cilk} to implement all six of the benchmarks, including LU decomposition with partial pivoting, matrix multiplication, vector add, matrix transpose, updates of random locations in a large table, and a huge 1-dimensional FFT. I measured the performance on the NASA's "Columbia" SGI Altix system. The programs achieved good performance (e.g., up to 943Flops on 256 processors for matrix multiplication). I added a total of only 137 keywords to transform the six C programs into Cilk programs.

Introduction

The HPC Challenge competition [6] is based on a suite of seven MPI programs called the the HPC Challenge benchmark [5]. The seven programs include HPL (LU Decomposition with partial pivoting), RandomAccess (update random locations in a huge table), STREAM (vector scale and add), FFT (a single huge 1-dimensional FFT), DGEMM (a single huge matrix multiplication), and PTRANS (a single huge matrix transposition). The competition focuses on the first four benchmarks.

The HPC Challenge recognizes two classes of entries. The Class~1 (performance) awards are made to teams that achieve the highest performance running an unmodified reference program written in MPI. The Class~2 (productivity) competition allows submissions to be written in any programming language, and the award is weighted 50% on performance and 50% on code elegance, clarity, and size. The Class~2 competition required entrants to submit at least 3 of the first four HPC programs. I submitted all six.

Although Cilk runs on a wide variety of platforms including x86, powerPC, Itanium, Sparc, and MIPS processors, I ran my programs on the largest SMP available in order to maximize my ``performance'' score. I benchmarked my programs on the NASA Columbia system, which consists of many Altix 3700 machines with 1.5GHz Itanium 2 processors, and 2GiBytes of memory per processor. I used shared memory configurations of up to 256 processors.

The following table shows the code size and performance for each benchmark. The code size is expressed as lines of code, including comments, C kernels and header files .h files), but not including the Intel MKL or FFTW libraries. Simply counting the lines of code does not demonstrate Cilk's productivity advantages, however. A better measure of productivity is the number of keywords that must be removed to turn the Cilk program back into a C program. This number, called the ``distance-to-desktop'' is also shown.

Cilk Code Size
	HPL	DGEMM	Stream	PTRANS	RandomAccess	FFTE
Cilk lines of code	348	97	58	81	123	230
Distance to Desktop	41	19	11	13	18	35
MPI lines of code	15,608	??	658	2,261	1,883	1,747

Performance
CPUS	HPL	DGEMM	Stream	PTRANS	RandomAccess	FFTE
CPUS	(GFLOPS)	(GFLOPS)	(GB/s)	(GB/s)	(GUPS)	(GFLOPS)
1	5.2	5.1	0.8	0.7	0.00503	0.7
2	9.4	9.7	0.9	0.5	0.00637	0.9
4	17.3	19.7	1.8	0.9	0.01022	1.8
8	30.8	35.7	2.9	1.7	0.01625	2.9
16	52.5	64.9	4.0	3.3	0.01625	4.0
32	88.6	118.9	6.8	6.1	0.01521	6.8
64	101.6	248.0	14.0	11.6	0.02269	14.0
128		463.1	25.0	18.3	0.01080	25.9
256		943.0	44.2	27.2	0.00996	49.5
384		1195.9	54.1
MPI-32	129.176		0.869	2.558	0.00364	4.080
MPI-128	638.902		2.158	7.523	0.01124	4.080

Configuration

The configuration was as follows. The Altix 3700 contained 2 Gigabytes per processor. All benchmarks are ``global'' versions solving one large problem that fits into just over 1/4th of memory. The benchmarks are listed in the order described on the HPCC web page. For a few of the larger runs, I was unable to obtain machine time before the deadline. Also shown are the code size for MPI as well as the the best MPI-based numbers from the HPCC results page for SGI Altix 3700 with 32 and 128 processors.

I compiled the Cilk programs using Cilk 5.4.2.4 and gcc 3.3.3, using the ``-O2'' optimization flag. I some cases I compiled the kernels of the code using the Intel C compiler (icc), version 9.1, using the ``-O3'' optimization flag.

The problem sizes where chosen as specified in the HPC Challenge benchmark. Typically this meant that problem size was chosen to be at least 1/4 of physical memory (that is, 0.5GiB per processor.)

Cilk Advantages and Disadvantages

The advantages of Cilk, as demonstrated by this entry are:

Cilk is simple. Very little of the code is concerned with parallelism. Most of the code complexity is focused on making good use of cache.
Cilk code is small.
Cilk programs can use the best serial code for the bulk of the computational work. For example, I used the Intel MKL, FFTW, and various kernels compiled with the Intel C Compiler.
Cilk programs can express cache-efficient code, giving good performance on cost-effective machines.
Cilk is efficient. On 32 processors, the Cilk code outperforms the MPI code for Stream, PTRANS, and RandomAccess (but Cilk falls behind on HPL). On 128 processors Cilk outperforms MPI for Stream and PTRANS, and performs comparably for RandomAccess (and again is behind for HPL.)
Cilk scales down as well as up. Cilk achieves good performance on 1 processor.
Cilk is portable across a wide variety of shared-memory machines.

I wrote most of these code over just a few days. Even the FFTE code, which is relatively complex, came together in only 2 days. The HPL code took longer: It took me a week to build a C version of the code, and it took Keith Randall a few more days to Cilkify the code. This is not too bad even compared to developing serial code, and is certainly more productive than MPI coding. Cilk enabled me to build all-new implementations of FFTE and HPL in only a few weeks of elapsed time.

Cilk's main disadvantage is that it requires a shared-memory machine. This shows up as two costs: (1) today's shared-memory machines are more expensive than today's clusters. (2) The operating system in a shared-memory machine can introduce serial bottlenecks (as shown by RandomAccess.) I believe that the technology trends favor Cilk in both of these areas. (1) Multicore processing technology will mean that in the future, every machine will be a shared memory machine at the node, and it seems likely that hardware makers will provide ways to build shared memory clusters as cheaply as distributed-memory clusters. (2) The operating systems will improve as shared-memory machines become more widespread. In the future, the costs will drop for shared memory.

The Benchmarks

Here I remark on some interesting points about each benchmark. The code itself is available at http://bradley.csail.mit.edu/~bradley/hpcc06/

HPL: Our code for performing LU decomposition with pivoting is based on an algorithm developed by Sivan Toledo [8]. The Toledo algorithm is numerically stable, since its algorithm is simply a rescheduling of the same data flow graph from the naive LU decomposition. The algorithm makes efficient use of cache, and exhibits enough parallelism to keep processors busy. n The code we used for the final submission exhibits good speedup up to about 32 processors. Cilk's critical-path-length instrumentation indicates that there is plenty of parallelism in the code, suggesting that the Altix hardware is limiting the performance.
DGEMM: The matrix-multiplication code employs a divide-and-conquer approach, and is very similar to DGEMM code that appears in our HPL code. The base case of the recursion is an m*k*n DGEMM where m+k+n<512. My code employs the Intel MKL to implement the base case of the recursion. Cilk DGEMM achieves speedup of 188 on 256 processors.
Stream: Stream achieves poor speedup when transitioning from 1 to 2 processors, probably because the two processors share a memory bus. Stream then exhibits good speedups up to nearly the largest machine size.
PTRANS: The trick for achieving good performance on PTRANS is to avoid cache associativity conflicts, especially if the array is a power of two in size. To do so, I stop the divide-and-conquer recursion near at about the size that fits in cache, and copy the data to an array with short stride, transpose it, and copy it back. Thus my PTRANS code is cache-aware, unlike the other codes a which are cache oblivious.
RandomAccess: The performance of my randomaccess implementation does not scale with the number of processors. The randomaccess code achieves a maximum performance of 0.022 GUPS on 64 processors, and then gets worse for larger numbers of processors. I suspect that the performance is limited by the rate at which the operating system can handle TLB misses. One possible explanation is that the operating system serializes TLB misses. Under this theory, the Linux 2.6.5 ia64 kernel can only process only one TLB miss at a time, and it takes about 50$\mu$s to handle a TLB miss. If true, then for large numbers of processors, nearly half the performance of the STREAM benchmark goes into handling serialized TLB misses as well. This performance problem seems to be an operating system issue, not a hardware or programming language issue.
FFTE: My FFTE code employs FFTW [3], which provides a Cilk interface to achieve parallelism when performing an many FFTs on an array. One problem I faced for FFTE is that the FFTW program is not optimized for large one-dimensional FFT problems. So I implemented the 4-step solution: Compute FFTs on the rows. Transpose and multiply by the twiddle factors. Compute FFTs on the columns. Since serial FFT is most efficient on arrays which are a power of two, it was especially important to get the transpose right, using the technique mentioned above for PTRANS. Otherwise, cache associativity misses would hurt the performance.

Acknowledgments

Keith Randall contributed substantially to the LU-decomposition implementation. Chris Hill at MIT's department of Earth and Planetary Sciences helped run the benchmarks on the NASA Columbia machine. I used FFTW and the Intel MKL for the serial parts of certain benchmarks.

Research Support

This research was supported in part by MIT Lincoln Laboratories, and NSF Grants ACI-0324974, CNS-0615215, CCF-0541209, CCF-0621511, and CNS-0540248.

References:

[1] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37(1):55-69, August 25 1996.

[2] The Cilk Project..

[3] Mateo Frigo and Steven G. Johnson. FFTW: An adaptive software architecture for the FFT. In ICASSP Conference Proceedings, volume 3, pp. 1381--1384, 1998.

[4] Matteo Frigo, Charles E. Leiserson and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, pp. 212-223, Montreal, Quebec, Canada.

[5] HPC Challenge Benchmark. 2006.

[6] HPC Challenge Award Competition. 2006.

[7] Bradley C. Kuszmaul. Cilk Provides the ``Best Overall Productivity'' for High Performance Computing (and Won the HPC Challenge Award to Prove It) In The Symposium on Parallelism in Architectures and Algorithms, San Diego, CA 2007. To appear.

[8] Sivan Toledo. Locality of reference in {LU} decomposition with partial pivoting. SIAM J. Matrix Analysis and Applications, 18(4):1065--1081, October 1997.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu