MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

CAPRI: A Common Architecture for Distributed Probabilistic Internet Fault Diagnosis

George J. Lee

Introduction

Often when network failures occur in the Internet, users and network administrators have difficulty determining the precise location and cause of failure. For example, a failure to connect to a web server may result from faults in any one of a number of locations; perhaps the local network is misconfigured, a failure may have occurred along the route to the server, or perhaps the server's network card has failed. To a large extent the difficulty of diagnosis results not from a lack of tools or techniques for diagnosis, but from the challenge of efficiently coordinating the exchange of diagnostic observations, beliefs, and knowledge across multiple administrative domains in an unreliable and dynamic network.

Consider the diagnosis of an HTTP connection failure from a user to a web server; a diagnostic agent on the user's local machine might be able to perform certain local tests and observations about the user's network connection, but in order to accurately diagnose the failure, we also need information from specialist agents in other parts of the network that can test and observe other network components. In addition, diagnosis requires dependency knowledge about the relationships among various classes of network components in order to combine information from multiple agents and make inferences from the results of diagnostic tests. Diagnosis requires the exchange of information among all of these agents to perform distributed diagnosis, but each of these agents may be operated by different people in different parts of the network and may not know about the existence or capabilities of the others. Such an environment creates several communication challenges for distributed diagnostic agents. When one agent communicates with another, it needs the ability to answer the following questions: "What are you saying?" (representation of diagnostic information), "What can you do?" (describing agent capabilities), and "What does it mean?" (incorporating new diagnostic information).

The CAPRI Architecture

Figure 1: CAPRI agents produce diagnostic information using component class definitions, dependency knowledge, and diagnostic information.

Figure 2: CAPRI agents can express observations and probabilistic beliefs about network component properties and relationships.

The Common Architecture for Probabilistic Reasoning in the Internet (CAPRI) enables distributed diagnostic agents to incorporate information from multiple sources to produce new diagnostic information in a modular and extensible way (Figure 1). CAPRI addresses the challenges of distributed fault diagnosis using the following techniques:

Representing component and diagnostic test classes and properties using an extensible, distributed component ontology, enabling agents to define new component classes and properties to support new components and new diagnostic tests. Using this ontology, agents can then exchange observations, beliefs, and dependency knowledge according to a message exchange protocol. Figure 2 illustrates how an agent can describe observations and probabilistic beliefs about various properties and relationships of network components.
Advertising diagnostic capabilities and looking up other diagnostic agents according to a common service description language, enabling agents to dynamically determine which agents to contact to obtain additional diagnostic information.
A diagnosis procedure for incorporating new information received from other agents into a component graph, dynamically constructing a failure dependency graph from a component graph and a dependency knowledge base, and performing probabilistic inference with incomplete information. This procedure allows an agent to manage the long-term average probing and communication costs of diagnosis by aggregating data and requests and propagating diagnostic information according to an aggregation-friendly overlay topology. Figure 3 illustrates a failure dependency graph that an agent constructs after receiving the evidence from Figure 2 and incorporating information from a Verify DNS Lookup Test.

Figure 3: CAPRI agents can dynamically construct failure dependency graphs to infer the most likely cause of failure.

Unlike previous research in network management, probabilistic inference, machine learning, and distributed systems, this thesis takes a broader, long-term, wide-area networking approach to distributed fault diagnosis that takes into account distributed information, extensibility, and scalability. CAPRI is based on the idea of a Knowledge Plane [1] for communicating and reasoning about high-level knowledge and management information in the Internet. CAPRI enables diagnosis among distributed agents with different specialties and capabilities in a general way not specific to the diagnosis of any particular type of failure. This modular decomposition encourages specialization, reuse, and composition. In addition, CAPRI is designed for extensibility to enable the addition of add new diagnostic information, new agents, and new knowledge. Finally, CAPRI supports distributed diagnosis involving many users and agents by aggregating requests from many users and aggregating specialized services into more general services.

One of the greatest potential benefits of a common architecture for the exchange of diagnostic information is its ability to promote interaction among various types of diagnostic specialists to produce new diagnostic information and capabilities. By creating modular interfaces for communicating diagnostic information, CAPRI encourages specialization, reuse, information hiding, and composition of information. Certain agents may have special knowledge, technology, or resources for diagnosing a particular set of components. CAPRI enables such specialist agents to share their diagnostic capabilities with other agents. As technology improves and new diagnostic tests and methods become available, additional specialist agents may emerge. If the information that a specialist provides is useful, other agents may reuse the same information to diagnose many types of failures. Agents may also hide specialized information from others to help manage complexity and improve scalability, reducing the amount of information that other agents need to perform diagnosis. For example, an agent may advertise the capability to diagnose all types of DNS lookup failures without revealing information about all the other agents it needs to contact to perform that diagnosis. CAPRI also enables the composition of information from multiple agents to produce new information. For example, an agent for diagnosing HTTP connection failures might have no special capabilities of its own and simply combine observations, beliefs, and dependency knowledge from other agents to produce its diagnosis. Like the online encyclopedia Wikipedia which brings together distributed specialists to share information and build upon each other's knowledge to produce a shared repository of valuable information, CAPRI provides a framework for specialist agents to contribute their diagnostic capabilities and build upon the capabilities of other agents.

Results

In order to evaluate the effectiveness of CAPRI, I designed, implemented, and deployed a range of prototype diagnostic agents for diagnosing HTTP connection failures in the Internet. Diagnostic agents in this prototype identify the cause of HTTP connection failures in terms of the user's local network status, DNS lookup status, web server status, and IP route status. The ServStats Firefox extension acts as a local diagnostic agent for end users and can request diagnosis from a regional agent that runs on Planetlab. Regional agents communicate with DNS Lookup specialist agents, Web Server History specialist agents, AS Path specialist agents, and a Dependency Knowledge agent. Agents dynamically advertise and look up diagnostic capabilities using a directory service. Currently this prototype implementation includes about 2500 user agents, 15 regional agents, 4 Web Server History agents, 3 DNS Lookup agents, 14 AS Path agents, and one Dependency Knowledge agent. User agents detect HTTP connection failures in the Firefox web browser and can perform local tests or request additional diagnosis from regional agents. In addition, user agents also send information about past HTTP connections to regional agents. Regional agents then forward HTTP connection notifications from user agents to Web Server History agents in order to compute aggregate statistics on web servers useful for diagnosis. User and regional agents in this prototype produce approximately 10,000 failure diagnoses per day. Of these failures, approximately 14% are local network failures, 39% are web server failures, 32% are DNS lookup failures, 3% are IP routing failures, and 12% are unknown.

Diagnostic agents in CAPRI can effectively use the procedure for fault diagnosis to reduce the number of diagnostic requests they generate. Over a three day period, user agents observed 14,509 failures. The user agents diagnosed about 53% of these using only local tests without requesting any additional information from other agents. The user agents requested additional diagnosis from regional agents for the remaining 7690 failures. The 15 regional agents in turn selectively request additional information from the three types of specialist agents as necessary for diagnosis, producing 7281 belief requests to server history agents, 5360 requests to DNS lookup agents, and 1872 requests to AS path agents. Compared to all agents always requesting information from all available tests for each failure, diagnostic agents in CAPRI only perform 38% as many requests.

Figure 4: Learning improves accuracy and precision.

In addition, the probabilistic model of fault diagnosis that CAPRI agents use enables the learning of new dependency knowledge. Initially, regional agents diagnose failures using manually specified probabilistic dependency knowledge, resulting in an overall accuracy of 70%. Using dependency knowledge learned from previous failure observations, however, agents were able to diagnose failures with 95% accuracy (Figure 4).

References:

[1] David D. Clark, Craig Partridge, J. Christopher Ramming, and John T. Wroclawski. A Knowledge Plane for the Internet. In Proceedings of SIGCOMM '03, 2003.

[2] George J. Lee. CAPRI: A Common Architecture for Autonomous, Distributed Diagnosis of Internet Faults using Probabilistic Relational Models. In Proceedings of the First Workshop on Hot Topics in Autonomic Computing (HotAC I), Dublin, Ireland, June 2006.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu