CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

An Architecture for Network Fault Diagnosis in the Knowledge Plane

George J. Lee, Peyman Faratin & David D. Clark

Introduction

Communication networks today do a good job of allowing a wide variety of devices to communicate with one another, but when something fails it can be difficult to determine the cause. There has been a great deal of research into specific diagnostic methods for detecting network problems such as denial of service (DoS) attacks, Internet worms, BGP misconfiguration, and so on, but there no common infrastructure exists on the Internet today to automate such diagnosis. Therefore the goal of our research is to develop an architecture to make use of diverse diagnostic and data collection systems for the automated diagnosis of a wide range of network failures.

Our approach to this problem is based on the idea of the Knowledge Plane, a global network overlay that collects data about the network, reasons about its state and activity, and distributes this information to different points in the network. In addition to automated network fault diagnosis, the Knowledge Plane will enable automatic management and configuration of networks as well. The goal of our research is to design a scalable architecture for network fault diagnosis that supports many different data collection and diagnostic methods. We hope that this architecture will also encourage the development of new data collection and diagnostic systems and create a competitive market for data and diagnoses, enabling choice and promoting accuracy and efficiency. This system will allow end users to request diagnoses and receive prompt and accurate information about their problems.

The principal requirements of this diagnostic architecture are scalability and support for a wide range of diagnostic and data collection systems. In a network as large as the Internet, it is essential that the process of diagnosis be scalable and not overload the network. A network fault can potentially cause a huge number of hosts to experience problems and request diagnosis; our system must be able to effectively deal with the volume of requests generated in such situations.

This architecture must also support diverse diagnostic and data collection systems. There is a wide range of information that can be used for diagnosis. There is different data one could collect to perform diagnosis, and different situations may require different methods. Our system must support these various methods.

Approach

The diagnostic architecture we propose must enable the collection of many types of network data, different methods for making diagnoses, a way to match diagnostic systems with the data collection systems that provide the data they need, and a way for users to request and receive diagnoses. To support these functions in a scalable way, we classify the entities in this diagnostic architecture into the following systems:

Diagnostic Systems that handle diagnostic requests, reason about network data, and return diagnostic responses.
Data Collection Systems that collect a variety of data from different parts of the network and make this data available to diagnostic systems.
Index Servers to match users with appropriate diagnostic systems and to match diagnostic systems with the data collection systems provide the data they require.
User Agents to collect user input, request diagnoses on behalf of the user, and present diagnoses to the user in an understandable way.

By considering data collection, diagnosis, matching, and user interfaces separately, this architecture enables choice and competition. Different users can chose to use different types of user agents. A user agent may decide to use different diagnostic systems depending on the nature of a fault. Different diagnostic systems may use different types of data for diagnosis. Data collection systems may compete with one another to provide data to diagnostic systems, encouraging them to provide more accurate data at lower cost. Diagnostic systems may also compete with one another for users. This also promotes scalability. Creating standard interfaces for diagnosis and data collection reduces redundancy by allowing one system to be used by many others, and allows each system to aggregate and cache diagnostic and data collection requests and responses.

This architecture also promotes scalability by organizing systems by region. Diagnostic systems and data collection systems are responsible for different regions (e.g. Internet autonomous systems, or ASes) in the network, and communication in the diagnosis architecture occurs according to the topology of region connectivity. This allows information to be aggregated and cached to effectively deal with large networks and many users.

Progress and Future Work

We are currently working on creating a flexible ontology for describing data and diagnosis in this architecture and specifying the protocols for communication between different systems. This ontology should allow the description of different types of data and diagnoses, aggregation of data, and queries to find appropriate diagnosis and data collection systems.

We are also investigating the challenges of developing scalable aggregation, caching, and forwarding mechanisms; how to scalably organize this diagnostic architecture based on regions; supporting an intuitive user interface; and integration with existing data collection and diagnostic systems. Our goal is to implement this architecture on a testbed such as PlanetLab and compare different diagnostic and data collection systems. This would allow other researchers to implement their systems on top of this architecture and provide a way to evaluate diagnostic and data collection methods.

References

[1] David D. Clark, Craig Partridge, J. Christopher Ramming, and John T. Wroclawski. A knowledge plane for the internet. In Proceedings of the SIGCOMM 2003 conference, pages 3-10. ACM Press, 2003.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)