
Research
Abstracts  2007 
Adaptive Scheduling with Parallelism FeedbackKunal Agrawal, Yu Xiong He (NUS), Wen Jing Hsu (NTU) & Charles E. LeisersonIntroductionIn this research, we address the problem of adaptive scheduling and resource allocation in the domain of dynamic multithreading. Most existing parallel programming systems are nonadaptive, where each job is assigned a fixed number of processors. This policy places the burden of estimating the parallelism of the job on the programmer. In addition, nonadaptive scheduling may lead to a poor use of available resources. For example, if the job's parallelism changes while the job is executing, or if the resources available in the system change, the job is still forced to run with the same number of processors as it was allotted when it started executing. Nonadaptive allocation schemes may also lead to the starvation of some jobs if other jobs are using all the processors in the system. A more attractive model would be an adaptive model, where processors allotted to a job change according to the job's parallelism and the system environment. Multiprocessor scheduling in a shared multiprogramming environment is often structured as twolevel scheduling, where a kernellevel job scheduler allots processors to jobs and a userlevel thread scheduler schedules the work of a job on the allotted processors[5]. We have developed adaptive thread schedulers that provide continual parallelism feedback to the job scheduler in the form of requests for processors. Our schedulers guarantee that a job completes near optimally while utilizing at least a constant fraction of the allotted processor cycles. Scheduling ModelOur scheduling model is as follows. We assume that time is broken into a sequence of equalsize scheduling quanta 1, 2,... of length L, and the job scheduler is free to reallocate processors between quanta. The quantum length L is a system configuration parameter chosen to be long enough to amortize the time to reallocate processors among the various jobs and the time to perform various other bookkeeping for scheduling, including time for the task scheduler to communicate with the job scheduler, which typically involves a system call. Between quanta q1 and q, the task scheduler determines its job's desire d, which is the number of processors the job wants for quantum q. The task scheduler provides the desire d to the job scheduler as its parallelism feedback. The job scheduler follows some processor allocation policy to determine the processor availability p, which is the number of processors the job is entitled to get for the quantum q. The number of processors the job receives for quantum q is the job's allotment a = min{d, p}, which is the smaller of its desire and the processor availability. For example, suppose that a job requests d=5 processors and the job scheduler decides that the availability is p=10. Then, the job is allotted a=min{d,p}=min{5,10}=5 processors. If the availability is only p=3, however, the job's allotment is a=min{5,3}=3. Once a job is allotted its processors, the allotment does not change during the quantum. Consequently, the task scheduler must do a good job before a quantum of estimating how many processors it will need for all L time steps of the quantum, as well as do a good job of scheduling the tasks on the allotted processors. Theoretical Analysis Technique and ResultsOur analysis models the job scheduler as the task scheduler's adversary, challenging the task scheduler to be robust to the system environment and the job scheduler's administrative policies. For example, the job scheduler can make available a huge number of processors exactly when the job has little use for them. Treating the job scheduler as a malicious adversary allows us to abstract away all the other environmental factors (presence/absence of other jobs, etc.) as part of the adversary. Thus, we do not have to worry about these external factors and can now concentrate on a single job, and bound its running time and waste. For a job with work and critical path , both and are lower bounds on the running time, where is the mean of the processor availability during the computation. An adversarial job scheduler, however, can prevent any thread scheduler from providing good speedup with respect to the mean availability in the worst case. For example, if the adversary chooses a huge number of processors for the job's processor availability just when the job has little instantaneous parallelism, no adaptive scheduling algorithm can effectively utilize the available processors on that quantum. We introduce a technique called trim analysis to analyze the time bound of adaptive thread schedulers under these adversarial conditions. From the field of statistics, trim analysis borrows the idea of ignoring a few ``outliers.'' A trimmed mean, for example, is calculated by discarding a certain number of lowest and highest values and then computing the mean of those that remain. For our purposes, it suffices to trim the availability from just the high side. For a given value R, we define the Rhightrimmed mean availability as the mean availability after ignoring the $R$ steps with the highest availability. A good thread scheduler should provide linear speedup with respect to an Rtrimmed availability, where R is as small as possible. We have developed two task scheduling algorithms AGreedy (based on a greedy scheduler [3,7]) and ASteal (based on workstealing schedulers [2]), which are provably effective. We prove that these task schedulers guarantee linear speedup with respect to the trimmed availability. Specifically, they completes the job in time steps, where denotes the trimmed availability. Thus, the job achieves linear speed up with respect to when the job's parallelism dominates the trimmed availability. In addition, we prove that the total number of processor cycles wasted by the job is , representing at most a constant factor overhead. Dynamic equipartitioning [4,8,9] is an effective way for job schedulers to allocate processors to jobs. If the AGreedy (or ASteal) is used as task schedulers for every job and the job scheduler is a dynamic equipartitioning job scheduler, then we prove that both makespan and mean completion time of the jobs are constantcompetitive with respect to an optimal scheduler. The constants depend on parameter settings. Experimental ResultsWe have analyzed our task schedulers using an adversarial model for the job scheduler. In practice, however, one would not expect the job scheduler to behave diabolically. Thus, observed bounds on waste and completion time may actually be smaller than the theoretical bounds we have proved. We evaluated the performance of ASteal in practice using simulation studies. Our studies monitored the behavior of ASteal on a simulated multiprocessor system using synthetic workloads. We measured the completion time and waste of ASteal on over 2300 job runs using a variety of processor availability profiles. Linearregression analysis indicates that ASteal provides almost perfect linear speedup. In addition, ASteal typically wasted less than 20% of the processor cycles allotted to the job. We compared ASteal with the ABP algorithm, an adaptive workstealing thread scheduler developed by Arora, Blumofe, and Plaxton which does not employ parallelism feedback. On moderately to heavily loaded large machines with predetermined availability profiles, ASteal typically completed jobs more than twice as quickly, despite being allotted the same or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP. We compared the utilization of ASteal and ABP when many jobs with varying characteristics are using the same multiprocessor. These experiments provide evidence that ASteal consistently provides higher utilization than ABP for a variety of job mixes. Future WorkWe have evaluated the performance of ASteal in a simulated environment. The next step is to implement ASteal in a real programming environment. The Cilk[6] runtime system is a reasonable place to start and develop a prototype task scheduler that provides parallelism feedback. AcknowledgmentThanks to the members the Supercomputing Technologies Group for their feedback and comments. Siddhartha Sen contributed to an implementation of an adaptive scheduler for Cilk, which inspired this research. Research SupportThis research is supported in part by SingaporeMIT Alliance and by NSF grant ACI0615215. References:[1] Nimar S. Arora, Robert D. Blumofe, C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Symposium of Parallel Algorithms and Architectures, pp. 119129, Puerto Vallarta, Mexico, June 1998. [2] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Journal of the ACM, pp. 720748, September, 1999. [3] Richard. P. Brent. The parallel evaluation of general arithmetic expressions. In Journal of the ACM, 21(2):pp. 201206, April 1974. [4] Xiaotie Deng and Patrick Dymond. On Multiprocessor System Scheduling. In Proceedings of the Symposium on Parallel Algorithms and Architectures pp 8288, Padua, Italy, July 1996. [5] Dror G. Feitelson. Job Scheduling in Multiprogrammed Parallel Systems. Tech Report. IBM Research Report RC 19690 (87657), August 97. [6] Matteo Frigo,Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk5 Multithreaded Language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. pp 212223, Montreal, Quebec, Canada, June, 1998 [7] Ronald. L. Graham. Bounds on multi processor anomalies. In Journal of the Applied Mathematics, 17(2):pp. 416429, March 1969. [8] Nian Gu. Competitive Analysis of Dynamic Processor Allocation Strategies. Master's thesis, York University, June 1995. [9] Cathy McCann and Raj Vaswani and John Zahorjan. A Dynamic Processor Allocation Policy for Multiprogrammed SharedMemory Multiprocessors. In ACM Transactions on Computer Systems 11(2):pp 146178, May 1993. [10] Siddhartha Sen. Dynamic Processor Allocation for Adaptively Parallel Jobs. Master's thesis, Massachusetts Institute of Technology, September 2004. [11] Bin Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts Institute of Technology, January 1998. 

