MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Dynamic Inference of Abstract Types

Philip J. Guo, Jeff H. Perkins, Stephen McCamant & Michael D. Ernst

What

An abstract type groups variables that are used for a related purpose in a program. We describe a dynamic unification-based analysis for inferring abstract types. Initially, each run-time value gets a unique abstract type. A run-time interaction among values indicates that they have the same abstract type, so their abstract types are unified. Also at run time, abstract types for variables are accumulated from abstract types for values. The notion of interaction may be customized, permitting the analysis to compute finer or coarser abstract types. We have implemented the analysis for compiled x86 binaries and for Java bytecodes. Our experiments indicate that the inferred abstract types are useful for program comprehension, improve both the run time and the precision of a follow-on program analysis, and are more precise than the output of a comparable static analysis, without suffering from overfitting.

Languages like C and Java provide programmers with only a few basic numeric types (int, float, etc.). Thus, programmers often use a single declared type (such as int) to hold data of multiple distinct abstract types. Consider the following simple example:

int main() {
  int dist = 3000;
  int itemPrice = 50;
  int taxPrice = 3;
  totalCost (dist, itemPrice, taxPrice);
}

int totalCost (int d, int base, int tax) {
  int year = 2006;
  if (d > 1000) {
    int shippingFee = 10;
    return base + tax + shippingFee;
  }
  else {
    return base + tax;
  }
}

int main() {
  Distance dist = 3000;
  Money itemPrice = 50;
  Money taxPrice = 3;
  totalCost (dist, itemPrice, taxPrice);
}

Money totalCost (Distance d, Money base,
                 Money tax) {
  Year year = 2006;
  if (d > 1000) {
    Money shippingFee = 10;
    return base + tax + shippingFee;
  }
  else {
    return base + tax;
  }
}

The three variables in main() all have the same declared type, int, but two of them hold related quantities (amounts of money) as can be determined by the fact that the program adds them together within totalCost(), whereas the other holds a conceptually distinct quantity (a distance). Thus itemPrice and taxPrice have the same abstract type while dist has a separate abstract type.

Our analysis automatically determines when two variables hold values that are of the same abstract type without the need for manual annotations by the programmer. Given a program like the example on the left, it infers abstract types shown in colors (albeit without type names). The example on the right shows the more precise types that the analysis infers, in the form of type annotations in a hypothetical language.

Why

User studies show that abstract type information helps programmers understand their code better. It can be useful to a programmer in identifying abstract data types, and in locating sites of possible references to values. It can also help detect violations of the programmer's intent, such as when unexpected variables are found to have the same abstract type: this potentially indicates an abstraction violation.

In addition to helping programmers directly, another application of abstract types is to improve the efficiency and/or precision of a later program analysis An example is the Daikon dynamic invariant detection system [2], which analyzes program value traces to infer properties that hold over all observed executions. Without abstract type information, Daikon attempts to infer invariants over all sets of variables with the same declared type (e.g. int), which is expensive and produces invariants that are not meaningful. In the above example, Daikon may infer that d > tax; while this property is true, it is not likely to be interesting because it makes no sense to compare a distance and an amount of money. Abstract type information indicates which pairs of variables should be analyzed for potential invariants, which both improves Daikon's performance and helps it produce more relevant and less confusing output.

How

A previous approach [1] performed a static analysis to determine when two variables hold the same abstract type. Unfortunately, static analysis has several drawbacks. It does not scale easily to large or complex programs. It is also inherently imprecise; for instance, static analyses are generally unable to distinguish values that have been stored in and then retrieved from a single array.

Our approach is to use dynamic analysis to track values at run time. Dynamic analysis can scale well, because its running time is closely related to the program's running time, and it can avoid many language complexities that exist only at the source-code level. Its results are more precise because it need not apply approximations of run-time behavior, but can observe actual behavior.

Our analysis gives two values the same abstract type if they interact transitively by a series of operations. Values related by assignment are always in the same abstract type; whether other operations are counted as interaction can be configured, for instance to match the requirements of a subsequent tool. The analysis associates a fresh abstract type with each new value; this is implemented as a unique identifier that tags the value. (For a primitive representation type such as int, new values are dynamic instances of literals, values read from a file, etc.) Each operation on two values unifies their abstract types using an efficient union-find data structure. Two variables have the same abstract types if their values do.

The analysis reports, for any pair of variables at a given program point (typically procedure entries and exits), whether those variables ever held values of the same abstract type at that program point. Since variables hold different values as execution progresses, the abstract type of the value held by a variables must be recorded each time a program point is executed. In order to accommodate this, the analysis keeps a second variable-based set of abstract type information (independently for each program point) and merges the value-based information into that data structure at each execution of the program point.

Progress

We have developed two tools that implement our dynamic abstract type inference analysis: DynCompJ for Java and DynCompX for x86 executables of languages such as C and C++. DynCompJ is implemented by instrumenting the Java bytecode. DynCompX is implemented on top of the Fjalar dynamic analysis framework, which uses dynamic binary instrumentation.

In a pair of user studies, two developers outside our research group used the results of DynCompX to assist them in C code maintenance tasks. One researcher was trying to understand a 5843-line support vector machine implementation. Examining the results of DynCompX in one case confirmed his intuitions about the correspondence between the code and the algorithm it implements, and in another alerted him that the use of a global variable for temporary values was more prevalent than he had previously believed. A second researcher was a computational biologist who had recently refactored a 1804-line RNA folding simulation to convert 55 variables of abstract type "energy" from C type int to double. After seeing the results of running DynCompX on the code with a single test case, he estimated that having this information available before refactoring would have reduced the time required to perform it from 2 days to 1-2 hours.

We have also run DynCompJ and DynCompX on mid-sized Java and C programs (up to 100,000 lines of code), including bzip2, flex, perl and the comp package of javac. The abstract types generated from these runs allow Daikon to take less time and memory (for instance, by a factor of 5 for flex) and eliminate many spurious invariants.

Research Support

This research is supported by DARPA and a National Defense Science and Engineering Graduate Fellowship.

References:

[1] Robert O'Callahan and Daniel Jackson. Lackwit: A program understanding tool based on type inference. In Proceedings of the 19th International Conference on Software Engineering, pages 338-348, Boston, Massachusetts, May 1997.

[2] Michael D. Ernst, Adam Czeisler, William G. Griswold, and David Notkin. Quickly detecting relevant program invariants. In ICSE 2000, Proceedings of the 22nd International Conference on Software Engineering, pages 449-458, Limerick, Ireland, June 2000.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu