CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Distributed Cooperative Computing and Adversity

Bogdan Chlebus, Chryssis Georgiou, Darek Kowalski, Alexander Russell & Alex A. Shvartsman

Introduction

We are investigating the problem of performing cooperative work in dynamic distributed settings where the computing medium---processors and the network---is adversely affected time due to processors joining, leaving, or failing, due to the changing network connectivity and failures, and due to asynchrony in the system. Ability to cooperate on common tasks in a distributed setting is key to solving a broad range of computation problems ranging from distributed search such as SETI to distributed simulation and multi-agent collaboration in ad-hoc networks. In such settings there exists a conflict between efficiency of computation and fault tolerance: to perform the computation efficiently we need to eliminate redundant computation, yet the only way to achieve fault tolerance is through redundancy in computation and communication. We are interested in understand trade-offs between the efficiency and fault tolerance in a variety of distributed settings. We consider asynchrony as a form of failure where processors are unable to cooperate due to delays; in the extreme case delays are indistinguishable from processor failuires.

Approach

Our goal is to develop robust cooperative algorithms and to establish corresponding lower bounds and impossibility results. To study the trade-off between communication efficiency and fault tolerance in distributed cooperative applications, we consider the following abstract problem: p processors must perform t tasks subject to perturbation in the computing/communication medium. We call this cooperation problem Do-All. We have extensively studied this problem in several models, including shared-memory models with processor failures, in message-passing models under the various failure and asynchrnoy assumptions, and in partionable network settings. A survey of recent results can be found in [1].

The Complexity of Synchronous Iterative Do-All with Crashes

We consider the Do-All problem, an abstraction of cooperative activity, in the setting where t tasks need to be performed in a distributed system of p failure-prone processors. Many distributed and parallel algorithms have been developed for this problem and several algorithm simulations have been developed by iterating Do-All algorithms. The efficiency of the solutions for Do-All is measured in terms of work complexity where all processing steps taken by all processors are counted. Work is ideally expressed as a function of t, p, and f, the number of processor crashes. However the previous lower bounds and the upper bounds for extant algorithms did not adequately show how work depends on f. We present the first non-trivial lower bounds for Do-All that capture the dependence of work on t, p, and f [2]. For the model of computation where processors are able to make perfect load-balancing decisions locally, we also present matching upper bounds. We define the r-iterative Do-All problem that abstracts the repeated use of Do-All such as found in typical algorithm simulations. Our f-sensitive analysis enables us to derive tight bounds for r-iterative Do-All work (that are stronger than the r-fold work complexity of a single Do-All). Our approach that models perfect load-balancing allows for the analysis of specific algorithms to be divided into two parts: (i) the analysis of the cost of tolerating failures while performing work under "free" load-balancing, and (ii) the analysis of the cost of implementing load-balancing. We demonstrate the utility and generality of this approach by improving the analysis of two known efficient algorithms. We give an improved analysis of an efficient message-passing algorithm. We also derive a tight and complete analysis of the best known Do-All algorithm for the synchronous shared-memory model. Finally we present a new upper bound on simulations of synchronous shared-memory algorithms on crash-prone processors.

Writing-All Deterministically and Optimally Using a Non-Trivial Number of Asynchronous Processors

Here we consider the shared-memory version of the basic cooperation problem, where the tasks are abstracted as updates of shared-memory locations: using p processors write 1's into all locations of an array of size n. This variation of the Do-All problem is called Write-All. Despite substantial research, there is a dearth of efficient deterministic asynchronous algorithms for Write-All, where processor must efficiently update shared memory despite delays (or failures in the extreme case of an unbounded delay). Efficiency of algorithms is measured in terms of work. Thus an optimal algorithm would have work proportional to n, however it is known that optimality cannot be achieved when p is close to n. The quest then is to obtain work-optimal solutions for this problem using a non-trivial, compared to n, number of processors p. Our new result [3] significantly extends the range of processors for which optimality is achieved. The result shows that optimality can be achieved using close to sqrt(n). Additionally, the new result uses \emph{only} the atomic read/write memory, without resorting to using the test-and-set primitive that was necessary in a previous solution. We presented the algorithm and gave its analysis showing that the work complexity of the algorithm is approximately O(n+p^2), which is optimal when p = O(n^0.5), while all prior deterministic algorithms require super-linear work when p=Omega(n^0.25).

Collective Asynchronous Reading with Polylogarithmic Worst-Case Overhead

The Collect problem for an asynchronous shared-memory system has the objective for the processors to cooperatively learn all values of a collection of shared registers, while minimizing the total number of read and write operations. The model consists of n asynchronous processes, each with a single-writer multi-reader register of a polynomial capacity. The best previously known deterministic solution performs Previous best solution required O(n^3/2 log n) reads and writes. We presented a new deterministic algorithm that performs (n log^7 n) read/write operations [4], thus substantially improving the best previous upper bound. Using an approach based on epidemic rumor-spreading, the novelty of the new algorithm is in using a family of expander graphs and ensuring that each of the successive groups of processes collect and propagate sufficiently many rumors to the next group. The algorithm is adapted to the Repeatable Collect problem, which is an on-line version. The competitive latency of the new algorithm is O(log^7 n) vs. the much higher competitive latency O(sqrt(n) log n) of a prior solution. A result of independent interest in this paper abstracts a gossiping game that is played on a graph and that gives its payoff in terms of expansion.

Ongoing Work

In the past decade numerous algorithms were developed for such distributed cooperation problems, however there is a dearth of algorithms for asynchronous message-passing models where performance guarantees can be analytically established. We are investigating new algorithmic approaches for such settings and are researching precise upper/lower bounds on work (computation) and communication.

Research Support

This research has been supported by NSF ITR Grant CCR-0121277, AFOSR contact F49620-00-1-0097, NSF-NATO Award 0209588, the NSF Career Award 9984778 and by the NSF Grants 9988304 and 0311368.

References

[1] Chryssis Georgiou, Alexander Russell, Alexander A. Shvartsman. Distributed Cooperation and Adversity: Complexity Trade-Offs. Proceedings of the ACM Workshop on Principles of Computation and Knowledge -- PCK50, pages 60-71, 2003

[2] Chryssis Georgiou, Alexander Russell, Alexander A. Shvartsman: The complexity of synchronous iterative Do-All with crashes. Distributed Computing, 17(1), pages 47-63, 2004.

[3] Dariusz R. Kowalski, Alexander A. Shvartsman. Writing-all deterministically and optimally using a non-trivial number of asynchronous processors. Proc. of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). pages 311-320, 2004.

[4] Bogdan S. Chlebus, Dariusz R. Kowalski, Alexander A. Shvartsman. Collective asynchronous reading with polylogarithmic worst-case overhead. Proc. of ACM Symposium on Theory of Computing (STOC 2004), pages 321-330, 2004.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)