MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Adaptive Scheduling with Parallelism Feedback

Kunal Agrawal, Yu Xiong He (NUS), Wen Jing Hsu (NTU) & Charles E. Leiserson

Introduction

In this research, we address the problem of adaptive scheduling and resource allocation in the domain of dynamic multithreading. Most existing parallel programming systems are nonadaptive, where each job is assigned a fixed number of processors. This policy places the burden of estimating the parallelism of the job on the programmer. In addition, nonadaptive scheduling may lead to a poor use of available resources. For example, if the job's parallelism changes while the job is executing, or if the resources available in the system change, the job is still forced to run with the same number of processors as it was allotted when it started executing. Nonadaptive allocation schemes may also lead to the starvation of some jobs if other jobs are using all the processors in the system. A more attractive model would be an adaptive model, where processors allotted to a job change according to the job's parallelism and the system environment.

Multiprocessor scheduling in a shared multiprogramming environment is often structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on the allotted processors[5]. We have developed adaptive thread schedulers that provide continual parallelism feedback to the job scheduler in the form of requests for processors. Our schedulers guarantee that a job completes near optimally while utilizing at least a constant fraction of the allotted processor cycles.

Scheduling Model

Our scheduling model is as follows. We assume that time is broken into a sequence of equal-size scheduling quanta 1, 2,... of length L, and the job scheduler is free to reallocate processors between quanta. The quantum length L is a system configuration parameter chosen to be long enough to amortize the time to reallocate processors among the various jobs and the time to perform various other bookkeeping for scheduling, including time for the task scheduler to communicate with the job scheduler, which typically involves a system call. Between quanta q-1 and q, the task scheduler determines its job's desire d, which is the number of processors the job wants for quantum q. The task scheduler provides the desire d to the job scheduler as its parallelism feedback. The job scheduler follows some processor allocation policy to determine the processor availability p, which is the number of processors the job is entitled to get for the quantum q. The number of processors the job receives for quantum q is the job's allotment a = min{d, p}, which is the smaller of its desire and the processor availability. For example, suppose that a job requests d=5 processors and the job scheduler decides that the availability is p=10. Then, the job is allotted a=min{d,p}=min{5,10}=5 processors. If the availability is only p=3, however, the job's allotment is a=min{5,3}=3. Once a job is allotted its processors, the allotment does not change during the quantum. Consequently, the task scheduler must do a good job before a quantum of estimating how many processors it will need for all L time steps of the quantum, as well as do a good job of scheduling the tasks on the allotted processors.

Theoretical Analysis Technique and Results

Our analysis models the job scheduler as the task scheduler's adversary, challenging the task scheduler to be robust to the system environment and the job scheduler's administrative policies. For example, the job scheduler can make available a huge number of processors exactly when the job has little use for them. Treating the job scheduler as a malicious adversary allows us to abstract away all the other environmental factors (presence/absence of other jobs, etc.) as part of the adversary. Thus, we do not have to worry about these external factors and can now concentrate on a single job, and bound its running time and waste.

For a job with work $T_1<\i>$ and critical path $T_\infty<\i>$ , both $T_1/P_A<\i>$ and $T_\infty<\i>$ are lower bounds on the running time, where $P_A<\i>$ is the mean of the processor availability during the computation. An adversarial job scheduler, however, can prevent any thread scheduler from providing good speedup with respect to the mean availability $P_A<\i>$ in the worst case. For example, if the adversary chooses a huge number of processors for the job's processor availability just when the job has little instantaneous parallelism, no adaptive scheduling algorithm can effectively utilize the available processors on that quantum.

We introduce a technique called trim analysis to analyze the time bound of adaptive thread schedulers under these adversarial conditions. From the field of statistics, trim analysis borrows the idea of ignoring a few ``outliers.'' A trimmed mean, for example, is calculated by discarding a certain number of lowest and highest values and then computing the mean of those that remain. For our purposes, it suffices to trim the availability from just the high side. For a given value R, we define the R-high-trimmed mean availability as the mean availability after ignoring the $R$ steps with the highest availability. A good thread scheduler should provide linear speedup with respect to an R-trimmed availability, where R is as small as possible.

We have developed two task scheduling algorithms A-Greedy (based on a greedy scheduler [3,7]) and A-Steal (based on work-stealing schedulers [2]), which are provably effective. We prove that these task schedulers guarantee linear speedup with respect to the $O(T_\infty + L lg P<\i>)$ -trimmed availability. Specifically, they completes the job in $O(T_1/P_T + T_\infty + L lg P$ time steps, where $P_T<\i>$ denotes the $O(T_\infty + L lg P<\i>)$ -trimmed availability. Thus, the job achieves linear speed up with respect to $P_T<\i>$ when the job's parallelism dominates the $O(T_\infty + L lg P<\i>)$ -trimmed availability. In addition, we prove that the total number of processor cycles wasted by the job is $O(T<\i>_1)$ , representing at most a constant factor overhead.

Dynamic equipartitioning [4,8,9] is an effective way for job schedulers to allocate processors to jobs. If the A-Greedy (or A-Steal) is used as task schedulers for every job and the job scheduler is a dynamic equipartitioning job scheduler, then we prove that both makespan and mean completion time of the jobs are constant-competitive with respect to an optimal scheduler. The constants depend on parameter settings.

Experimental Results

We have analyzed our task schedulers using an adversarial model for the job scheduler. In practice, however, one would not expect the job scheduler to behave diabolically. Thus, observed bounds on waste and completion time may actually be smaller than the theoretical bounds we have proved. We evaluated the performance of A-Steal in practice using simulation studies.

Our studies monitored the behavior of A-Steal on a simulated multiprocessor system using synthetic workloads. We measured the completion time and waste of A-Steal on over 2300 job runs using a variety of processor availability profiles. Linear-regression analysis indicates that A-Steal provides almost perfect linear speedup. In addition, A-Steal typically wasted less than 20% of the processor cycles allotted to the job. We compared A-Steal with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora, Blumofe, and Plaxton which does not employ parallelism feedback. On moderately to heavily loaded large machines with predetermined availability profiles, A-Steal typically completed jobs more than twice as quickly, despite being allotted the same or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP. We compared the utilization of A-Steal and ABP when many jobs with varying characteristics are using the same multiprocessor. These experiments provide evidence that A-Steal consistently provides higher utilization than ABP for a variety of job mixes.

Future Work

We have evaluated the performance of A-Steal in a simulated environment. The next step is to implement A-Steal in a real programming environment. The Cilk[6] runtime system is a reasonable place to start and develop a prototype task scheduler that provides parallelism feedback.

Acknowledgment

Thanks to the members the Supercomputing Technologies Group for their feedback and comments. Siddhartha Sen contributed to an implementation of an adaptive scheduler for Cilk, which inspired this research.

Research Support

This research is supported in part by Singapore-MIT Alliance and by NSF grant ACI-0615215.

References:

[1] Nimar S. Arora, Robert D. Blumofe, C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Symposium of Parallel Algorithms and Architectures, pp. 119--129, Puerto Vallarta, Mexico, June 1998.

[2] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Journal of the ACM, pp. 720--748, September, 1999.

[3] Richard. P. Brent. The parallel evaluation of general arithmetic expressions. In Journal of the ACM, 21(2):pp. 201--206, April 1974.

[4] Xiaotie Deng and Patrick Dymond. On Multiprocessor System Scheduling. In Proceedings of the Symposium on Parallel Algorithms and Architectures pp 82--88, Padua, Italy, July 1996.

[5] Dror G. Feitelson. Job Scheduling in Multiprogrammed Parallel Systems. Tech Report. IBM Research Report RC 19690 (87657), August 97.

[6] Matteo Frigo,Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. pp 212-223, Montreal, Quebec, Canada, June, 1998

[7] Ronald. L. Graham. Bounds on multi processor anomalies. In Journal of the Applied Mathematics, 17(2):pp. 416--429, March 1969.

[8] Nian Gu. Competitive Analysis of Dynamic Processor Allocation Strategies. Master's thesis, York University, June 1995.

[9] Cathy McCann and Raj Vaswani and John Zahorjan. A Dynamic Processor Allocation Policy for Multiprogrammed Shared-Memory Multiprocessors. In ACM Transactions on Computer Systems 11(2):pp 146--178, May 1993.

[10] Siddhartha Sen. Dynamic Processor Allocation for Adaptively Parallel Jobs. Master's thesis, Massachusetts Institute of Technology, September 2004.

[11] Bin Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts Institute of Technology, January 1998.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu