MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Adaptive Scheduling with Parallelism Feedback

Kunal Agrawal, Yu Xiong He, Wen Jing Hsu & Charles E. Leiserson

Introduction

In this research, we address the problem of adaptive scheduling and resource allocation in the domain of dynamic multithreading. Most existing parallel programming systems are nonadaptive, where each job is assigned a fixed number of processors. This policy places the burden of estimating the parallelism of the job on the programmer. In addition, nonadaptive scheduling may lead to a poor use of available resources. For example, if the job's parallelism changes while the job is executing, or if the resources available in the system change, the job is still forced to run with the same number of processors as it was allotted when it started executing. Nonadaptive allocation schemes may also lead to the starvation of some jobs if other jobs are using all the processors in the system. A more attractive model would be an adaptive model, where processors allotted to a job change according to the job's parallelism and the system environment.

Multiprocessor scheduling in a shared multiprogramming environment is often structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on the allotted processors[5]. We have developed adaptive thread schedulers that provide continual parallelism feedback to the job scheduler in the form of requests for processors. Our schedulers guarantee that a job completes near optimally while utilizing at least a constant fraction of the allotted processor cycles.

Scheduling Model

Our scheduling model is as follows. We assume that time is broken into a sequence of equal-size scheduling quanta 1, 2,... of length L, and the job scheduler is free to reallocate processors between quanta. The quantum length L is a system configuration parameter chosen to be long enough to amortize the time to reallocate processors among the various jobs and the time to perform various other bookkeeping for scheduling, including time for the task scheduler to communicate with the job scheduler, which typically involves a system call. Between quanta q-1 and q, the task scheduler determines its job's desire d, which is the number of processors the job wants for quantum q. The task scheduler provides the desire d to the job scheduler as its parallelism feedback. The job scheduler follows some processor allocation policy to determine the \defn{processor availability} p, which is the number of processors the job is entitled to get for the quantum q. The number of processors the job receives for quantum q is the job's allotment a = min{d,}, which is the smaller of its desire and the processor availability. For example, suppose that a job requests d=5 processors and the job scheduler decides that the availability is p=10. Then, the job is allotted a=min{d,p}=min{5,10}=5 processors. If the availability is only p=3, however, the job's allotment is a=min{5,3}=3. Once a job is allotted its processors, the allotment does not change during the quantum. Consequently, the task scheduler must do a good job before a quantum of estimating how many processors it will need for all L time steps of the quantum, as well as do a good job of scheduling the tasks on the allotted processors.

Analysis Technique and Results

Our analysis models the job scheduler as the task scheduler's adversary, challenging the task scheduler to be robust to the system environment and the job scheduler's administrative policies. For example, the job scheduler can make available a huge number of processors exactly when the job has little use for them. Treating the job scheduler as a malicious adversary allows us to abstract away all the other environmental factors (presence/absence of other jobs, etc.) as part of the adversary. Thus, we do not have to worry about these external factors and can now concentrate on a single job, and bound its running time and waste.

For a job with work $T_1<\i>$ and critical path $T_\infty<\i>$ , both $T_1/P_A<\i>$ and $T_\infty<\i>$ are lower bounds on the running time, where $P_A<\i>$ is the mean of the processor availability during the computation. An adversarial job scheduler, however, can prevent any thread scheduler from providing good speedup with respect to the mean availability $P_A<\i>$ in the worst case. For example, if the adversary chooses a huge number of processors for the job's processor availability just when the job has little instantaneous parallelism, no adaptive scheduling algorithm can effectively utilize the available processors on that quantum.

We introduce a technique called trim analysis to analyze the time bound of adaptive thread schedulers under these adversarial conditions. From the field of statistics, trim analysis borrows the idea of ignoring a few ``outliers.'' A trimmed mean, for example, is calculated by discarding a certain number of lowest and highest values and then computing the mean of those that remain. For our purposes, it suffices to trim the availability from just the high side. For a given value R, we define the R-high-trimmed mean availability as the mean availability after ignoring the $R$ steps with the highest availability. A good thread scheduler should provide linear speedup with respect to an R-trimmed availability, where R is as small as possible.

We have developed two task scheduling algorithms A-Greedy (based on a greedy scheduler [3]) and A-Steal (based on work-stealing schedulers [2]), which are provably effective. We prove that these task schedulers guarantee linear speedup with respect to the $O(T_\infty + L lg P<\i>)$ -trimmed availability. Specifically, they completes the job in $O(T_1/P_T + T_\infty + L lg P$ time steps, where $P_T<\i>$ denotes the $O(T_\infty + L lg P<\i>)$ -trimmed availability. Thus, the job achieves linear speed up with respect to $P_T<\i>$ when the job's parallelism dominates the $O(T_\infty + L lg P<\i>)$ -trimmed availability. In addition, we prove that the total number of processor cycles wasted by the job is $O(T<\i>_1)$ , representing at most a constant factor overhead.

Future Work

We have analyzed our task schedulers using an adversarial model for the job scheduler. In practice, however, one would not expect the job scheduler to behave diabolically. Thus, observed bounds on waste and completion time may actually be smaller than the theoretical bounds we have proved. In particular, in this paper we proved that the waste is at most a constant factor of the work. We have begun empirical studies, which although preliminary, seem to indicate that the observed constant is actually quite a bit smaller than the theoretical bound indicates. We are currently engaged in empirical studies and in implementing a practical scheduler, both of which should shed light how these task schedulers performs in the real world.

Dynamic equipartitioning [4,7,8] appears to be an effective way for job schedulers to allocate processors to jobs. If task schedulers provide parallelism feedback using instantaneous parallelism and parallelism rarely changes during a scheduling quantum, a dynamic-equipartitioning job scheduler can optimize global properties like makespan and average completion time [4,6]. For many practical situations, however, a job's parallelism does change quickly and often, making it difficult to obtain perfect information about parallelism. We conjecture that by coupling a task scheduler like ours with a dynamic-equipartitioning job scheduler, provably good global properties can be obtained.

Acknowledgment

Thanks to the members the Supercomputing Technologies Group for their feedback and comments. Siddhartha Sen contributed to an implementation of an adaptive scheduler for Cilk, which inspired this research.

Research Support

This research is supported in part by Singapore-MIT Alliance and by NSF grants ACI-0324974 and CNS-0305606.

References:

[1] Nimar S. Arora, Robert D. Blumofe, C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Symposium of Parallel Algorithms and Architectures, pp. 119--129, Puerto Vallarta, Mexico, June 1998.

[2] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Journal of the ACM, pp. 720--748, September, 1999.

[3] Richard. P. Brent. The parallel evaluation of general arithmetic expressions. In Journal of the ACM, 21(2):pp. 201--206, April 1974.

[4] Xiaotie Deng and Patrick Dymond. On Multiprocessor System Scheduling. In Proceedings of the Symposium on Parallel Algorithms and Architectures pp 82--88, Padua, Italy, July 1996.

[5] Dror G. Feitelson. Job Scheduling in Multiprogrammed Parallel Systems. Tech Report. IBM Research Report RC 19690 (87657), August 97.

[6] Ronald. L. Graham. Bounds on multi processor anomalies. In Journal of the Applied Mathematics, 17(2):pp. 416--429, March 1969.

[7] Nian Gu. Competitive Analysis of Dynamic Processor Allocation Strategies. Master's thesis, York University, June 1995. <

[8] Cathy McCann and Raj Vaswani and John Zahorjan. A Dynamic Processor Allocation Policy for Multiprogrammed Shared-Memory Multiprocessors. In ACM Transactions on Computer Systems 11(2):pp 146--178, May 1993.

[9] Siddhartha Sen. Dynamic Processor Allocation for Adaptively Parallel Jobs. Master's thesis, Massachusetts Institute of Technology, September 2004.

[10] Bin Song. Scheduling adaptively parallel jobs. Master's thesis, Massachusetts Institute of Technology, January 1998.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu