Abstracts - 2006
Dynamic Inference of Abstract Types
Philip J. Guo, Jeff H. Perkins, Stephen McCamant & Michael D. Ernst
An abstract type groups variables that are used for a related purpose in a program. We describe a dynamic unification-based analysis for inferring abstract types. Initially, each run-time value gets a unique abstract type. A run-time interaction among values indicates that they have the same abstract type, so their abstract types are unified. Also at run time, abstract types for variables are accumulated from abstract types for values. The notion of interaction may be customized, permitting the analysis to compute finer or coarser abstract types. We have implemented the analysis for compiled x86 binaries and for Java bytecodes. Our experiments indicate that the inferred abstract types are useful for program comprehension, improve both the run time and the precision of a follow-on program analysis, and are more precise than the output of a comparable static analysis, without suffering from overfitting.
Languages like C and Java provide programmers with only a few basic
numeric types (
The three variables in
Our analysis automatically determines when two variables hold values that are of the same abstract type without the need for manual annotations by the programmer. Given a program like the example on the left, it infers abstract types shown in colors (albeit without type names). The example on the right shows the more precise types that the analysis infers, in the form of type annotations in a hypothetical language.
User studies show that abstract type information helps programmers understand their code better. It can be useful to a programmer in identifying abstract data types, and in locating sites of possible references to values. It can also help detect violations of the programmer's intent, such as when unexpected variables are found to have the same abstract type: this potentially indicates an abstraction violation.
In addition to helping programmers directly, another application of
abstract types is to improve the efficiency and/or precision of a later
program analysis An example is the Daikon
dynamic invariant detection system , which analyzes program value traces
to infer properties that hold over all observed executions. Without abstract
type information, Daikon attempts to infer invariants over all sets of
variables with the same declared type (e.g.
A previous approach  performed a static analysis to determine when two variables hold the same abstract type. Unfortunately, static analysis has several drawbacks. It does not scale easily to large or complex programs. It is also inherently imprecise; for instance, static analyses are generally unable to distinguish values that have been stored in and then retrieved from a single array.
Our approach is to use dynamic analysis to track values at run time. Dynamic analysis can scale well, because its running time is closely related to the program's running time, and it can avoid many language complexities that exist only at the source-code level. Its results are more precise because it need not apply approximations of run-time behavior, but can observe actual behavior.
Our analysis gives two values the same abstract type if they interact
transitively by a series of operations. Values related by assignment are
always in the same abstract type; whether other operations are counted
as interaction can be configured, for instance to match the requirements
of a subsequent tool. The analysis associates a fresh abstract type with
each new value; this is implemented as a unique identifier that tags the
value. (For a primitive representation type such as
The analysis reports, for any pair of variables at a given program point (typically procedure entries and exits), whether those variables ever held values of the same abstract type at that program point. Since variables hold different values as execution progresses, the abstract type of the value held by a variables must be recorded each time a program point is executed. In order to accommodate this, the analysis keeps a second variable-based set of abstract type information (independently for each program point) and merges the value-based information into that data structure at each execution of the program point.
We have developed two tools that implement our dynamic abstract type inference analysis: DynCompJ for Java and DynCompX for x86 executables of languages such as C and C++. DynCompJ is implemented by instrumenting the Java bytecode. DynCompX is implemented on top of the Fjalar dynamic analysis framework, which uses dynamic binary instrumentation.
In a pair of user studies, two developers outside our research group used the results of DynCompX to assist them in C code maintenance tasks. One researcher was trying to understand a 5843-line support vector machine implementation. Examining the results of DynCompX in one case confirmed his intuitions about the correspondence between the code and the algorithm it implements, and in another alerted him that the use of a global variable for temporary values was more prevalent than he had previously believed. A second researcher was a computational biologist who had recently refactored a 1804-line RNA folding simulation to convert 55 variables of abstract type "energy" from C type int to double. After seeing the results of running DynCompX on the code with a single test case, he estimated that having this information available before refactoring would have reduced the time required to perform it from 2 days to 1-2 hours.
We have also run DynCompJ and DynCompX on mid-sized Java and C programs (up to 100,000 lines of code), including bzip2, flex, perl and the comp package of javac. The abstract types generated from these runs allow Daikon to take less time and memory (for instance, by a factor of 5 for flex) and eliminate many spurious invariants.
This research is supported by DARPA and a National Defense Science and Engineering Graduate Fellowship.
 Robert O'Callahan and Daniel Jackson. Lackwit: A program understanding tool based on type inference. In Proceedings of the 19th International Conference on Software Engineering, pages 338-348, Boston, Massachusetts, May 1997.
 Michael D. Ernst, Adam Czeisler, William G. Griswold, and David Notkin. Quickly detecting relevant program invariants. In ICSE 2000, Proceedings of the 22nd International Conference on Software Engineering, pages 449-458, Limerick, Ireland, June 2000.