|
Research
Abstracts - 2007 |
Cilk Provides the ``Best Overall Productivity'' for High Performance Computing (and Won the HPC Challenge Award to Prove It)Bradley C. KuszmaulSummaryMy entry won award for "Best Overall Productivity" in the 2006 HPC Challenge Class 2 (productivity) competition. I used the Cilk multithreaded programming language \cite{Cilk} to implement all six of the benchmarks, including LU decomposition with partial pivoting, matrix multiplication, vector add, matrix transpose, updates of random locations in a large table, and a huge 1-dimensional FFT. I measured the performance on the NASA's "Columbia" SGI Altix system. The programs achieved good performance (e.g., up to 943Flops on 256 processors for matrix multiplication). I added a total of only 137 keywords to transform the six C programs into Cilk programs. IntroductionThe HPC Challenge competition [6] is based on a suite of seven MPI programs called the the HPC Challenge benchmark [5]. The seven programs include HPL (LU Decomposition with partial pivoting), RandomAccess (update random locations in a huge table), STREAM (vector scale and add), FFT (a single huge 1-dimensional FFT), DGEMM (a single huge matrix multiplication), and PTRANS (a single huge matrix transposition). The competition focuses on the first four benchmarks. The HPC Challenge recognizes two classes of entries. The Class~1 (performance) awards are made to teams that achieve the highest performance running an unmodified reference program written in MPI. The Class~2 (productivity) competition allows submissions to be written in any programming language, and the award is weighted 50% on performance and 50% on code elegance, clarity, and size. The Class~2 competition required entrants to submit at least 3 of the first four HPC programs. I submitted all six. Although Cilk runs on a wide variety of platforms including x86, powerPC, Itanium, Sparc, and MIPS processors, I ran my programs on the largest SMP available in order to maximize my ``performance'' score. I benchmarked my programs on the NASA Columbia system, which consists of many Altix 3700 machines with 1.5GHz Itanium 2 processors, and 2GiBytes of memory per processor. I used shared memory configurations of up to 256 processors. The following table shows the code size and performance for each benchmark. The code size is expressed as lines of code, including comments, C kernels and header files .h files), but not including the Intel MKL or FFTW libraries. Simply counting the lines of code does not demonstrate Cilk's productivity advantages, however. A better measure of productivity is the number of keywords that must be removed to turn the Cilk program back into a C program. This number, called the ``distance-to-desktop'' is also shown.
ConfigurationThe configuration was as follows. The Altix 3700 contained 2 Gigabytes per processor. All benchmarks are ``global'' versions solving one large problem that fits into just over 1/4th of memory. The benchmarks are listed in the order described on the HPCC web page. For a few of the larger runs, I was unable to obtain machine time before the deadline. Also shown are the code size for MPI as well as the the best MPI-based numbers from the HPCC results page for SGI Altix 3700 with 32 and 128 processors. I compiled the Cilk programs using Cilk 5.4.2.4 and gcc 3.3.3, using the ``-O2'' optimization flag. I some cases I compiled the kernels of the code using the Intel C compiler (icc), version 9.1, using the ``-O3'' optimization flag. The problem sizes where chosen as specified in the HPC Challenge benchmark. Typically this meant that problem size was chosen to be at least 1/4 of physical memory (that is, 0.5GiB per processor.) Cilk Advantages and DisadvantagesThe advantages of Cilk, as demonstrated by this entry are:
I wrote most of these code over just a few days. Even the FFTE code, which is relatively complex, came together in only 2 days. The HPL code took longer: It took me a week to build a C version of the code, and it took Keith Randall a few more days to Cilkify the code. This is not too bad even compared to developing serial code, and is certainly more productive than MPI coding. Cilk enabled me to build all-new implementations of FFTE and HPL in only a few weeks of elapsed time. Cilk's main disadvantage is that it requires a shared-memory machine. This shows up as two costs: (1) today's shared-memory machines are more expensive than today's clusters. (2) The operating system in a shared-memory machine can introduce serial bottlenecks (as shown by RandomAccess.) I believe that the technology trends favor Cilk in both of these areas. (1) Multicore processing technology will mean that in the future, every machine will be a shared memory machine at the node, and it seems likely that hardware makers will provide ways to build shared memory clusters as cheaply as distributed-memory clusters. (2) The operating systems will improve as shared-memory machines become more widespread. In the future, the costs will drop for shared memory. The BenchmarksHere I remark on some interesting points about each benchmark. The code itself is available at http://bradley.csail.mit.edu/~bradley/hpcc06/
AcknowledgmentsKeith Randall contributed substantially to the LU-decomposition implementation. Chris Hill at MIT's department of Earth and Planetary Sciences helped run the benchmarks on the NASA Columbia machine. I used FFTW and the Intel MKL for the serial parts of certain benchmarks. Research SupportThis research was supported in part by MIT Lincoln Laboratories, and NSF Grants ACI-0324974, CNS-0615215, CCF-0541209, CCF-0621511, and CNS-0540248. References:[1] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37(1):55-69, August 25 1996. [2] The Cilk Project.. [3] Mateo Frigo and Steven G. Johnson. FFTW: An adaptive software architecture for the FFT. In ICASSP Conference Proceedings, volume 3, pp. 1381--1384, 1998. [4] Matteo Frigo, Charles E. Leiserson and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, pp. 212-223, Montreal, Quebec, Canada. [5] HPC Challenge Benchmark. 2006. [6] HPC Challenge Award Competition. 2006. [7] Bradley C. Kuszmaul. Cilk Provides the ``Best Overall Productivity'' for High Performance Computing (and Won the HPC Challenge Award to Prove It) In The Symposium on Parallelism in Architectures and Algorithms, San Diego, CA 2007. To appear. [8] Sivan Toledo. Locality of reference in {LU} decomposition with partial pivoting. SIAM J. Matrix Analysis and Applications, 18(4):1065--1081, October 1997. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|