Abstracts - 2006
Fjalar: A Dynamic Analysis Framework for C and C++ Programs
Philip J. Guo, Stephen McCamant & Michael D. Ernst
Fjalar is a framework that facilitates the construction of dynamic analysis tools for programs written in C and C++. It is often difficult to build robust and scalable dynamic analyses for C and C++ programs because of the lack of memory and type safety in these languages. For instance, the run time system does not keep track of array sizes or whether values have been initialized. Existing frameworks based on source-to-source transformation often suffer from scalability and robustness problems due to the difficulty of adding instrumentation source code to track memory usage and initialization. Frameworks that operate purely at the binary level cannot provide rich language-level information about data structures that are useful for many kinds of analyses. Fjalar combines aspects of both source- and binary-based approaches and can be used to build tools that safely access rich information at both the language and machine levels during execution.
Fjalar-based tools dynamically instrument un-modified x86/Linux executables compiled with DWARF2 debugging information. The ability to operate on executables rather than on source code places less burden on the tool's users, because there is no need to deal with complex configuration and Makefile options to determine which source files to instrument. It also achieves greater scalability by not having to deal with the difficulties of parsing and instrumenting complex C and C++ source code constructs. We have built two dynamic analysis tools using Fjalar: Kvasir, a value profiling tool, and DynComp, an abstract type inference tool. These Fjalar tools work on executables of C and C++ programs of more than 100,000 lines of code, including gcc, xemacs, Apache, and CTAS.
We developed the Fjalar framework to facilitate the construction of two dynamic program analysis tools in our research group:
The first tool, Kvasir, records rich run-time values of variables during program execution. This was developed to fulfill the need for a scalable and robust C/C++ front-end for the Daikon dynamic invariant detector . Daikon is a language-independent tool and requires separate language-specific front-ends (also available for Java, Perl, and some other languages) to provide it with traces of values which it can analyze to infer invariants. Kvasir was developed as a replacement for our old C/C++ front-end, named Dfec , which was based on a source-rewriting technique. Dfec had major scalability limitations, many caused by the complexities of dealing with C/C++ source code, which prevented it from working on programs of significant size or complexity. In order to overcome Dfec's limitations, Kvasir was designed using a mixed-level approach (combining source and binary approaches) and implemented atop the Fjalar framework.
The second tool, DynComp, performs dynamic inference of abstract types. Programmers often use the same declared type (e.g., int) to represent conceptually distinct values. For instance, one int variable might hold an array index while another int variable might hold the temperature in degrees Celsius. We call these types that the programmer intends to represent abstract types, and DynComp aims to infer those by observing the interactions of values during program execution. This analysis is useful for program understanding, bug finding, debugging, abstraction violation detection, and for making other program analyses more efficient.
Traditional dynamic analyses have either worked at the source level by performing source code analysis/rewriting (e.g. Dfec) or at the binary level by performing binary rewriting (e.g. memory-safety tools such as Purify ). However, neither approach is optimal for the analyses performed by Fjalar-based tools. The freedom that C and C++ give programmers to control the contents of memory makes it challenging for analysis tools to determine which regions are safe to dereference, which variables are initialized, and what they refer to. In order to address these challenges, we use a mixed-level approach which combines some of the key advantages of both source- and binary-based techniques. Fjalar is implemented via such a mixed-level approach which combines binary rewriting using Valgrind, memory safety enforcement using Valgrind Memcheck, and the use of source-level information from compiler-generated debugging information.
Fjalar performs binary rewriting of the target program (the program being analyzed) using the Valgrind framework , which allows analysis tools to supervise the execution of Linux programs on a virtual x86 processor. At arbitrary program points (by default procedure entry and exit points) it pauses the target program's execution and allows a tool to observe the current values of data structures and perform processing based upon this information.
Memory safety is crucial when observing program values in memory because C and C++ are relatively low-level languages that allow the program to manipulate memory contents in an unrestricted manner. For instance, a local variable declared as an 'int*' might point to an unallocated region of memory, an uninitialized (garbage) value, and even if it were allocated and initialized, could refer either to a single integer or to an array of any size. Fjalar must not attempt to dereference pointers which refer to unallocated regions of memory (because that might cause the program to crash) and Fjalar-based tools must be able to avoid processing uninitialized data (because they are semantically meaningless). Fjalar uses the Valgrind Memcheck tool to provide these memory safety guarantees. Memcheck works by keeping along with each byte of memory one A-bit (set when the byte has been allocated) and eight V-bits (set when the respective bit has been initialized with valid data). Given information about a variable and its address in memory, Fjalar can use Memcheck to figure out whether that location has been allocated (querying A-bits), and if so, whether its contents have been initialized by the program (querying V-bits).
The final component in our mixed-level approach for Fjalar is its use of debugging information as a surrogate for source code to obtain language-level information about variables, types, and functions. Debugging information is much easier to parse than source code and is relatively language-independent: for instance, the GCC family of compilers generates debugging symbols (according to the DWARF standard) for all the languages they support, including C, C++, Fortran, and Ada. Fjalar uses debugging information to map between data it collects at the memory level and language-level terms such as variable and function names used by higher-level tools.
The Kvasir tool has been under development since February 2004 and was the initial motivation for the development of Fjalar. We developed the mixed-level approach while implementing Kvasir and realized that it could be generalized into a framework for building a variety of dynamic analysis tools. In 2005, we developed the Fjalar framework and used it to build a newer version of Kvasir as well as the DynComp tool.
The table above shows some C and C++ programs on which Kvasir and DynComp have run successfully. We have applied Kvasir and DynComp on civserver, BIND, Apache, and CTAS as part of a run-time data structure repair system. We used those tools along with Daikon to generate data structure consistency constraints, and the repair tool enforced those constraints to prevent crashes caused by malformed input and malicious attacks.
As a reference, our previous source-based C/C++ front-end for Daikon, Dfec, could only run on programs as large as the smallest one listed on this table and often required extensive modifications to the target program's source before it could work. By contrast, Kvasir worked without any changes to most of the programs tested. There would be no way that we could have applied Daikon on complex, realistic C and C++ programs (like those used in the data structure repair experiments) using the old Dfec front-end.
Fjalar inherits the operating system and architecture limitations of Valgrind, but while the latest version of Valgrind supports Linux on x86, AMD-64, and PowerPC processors, Fjalar has only been developed so far for x86-Linux. A future goal is to enhance Fjalar to work on the AMD-64 and PowerPC architectures, and to track in-progress ports of Valgrind to FreeBSD and Mac OS X. We also aim to improve Fjalar's C++ support and test it more extensively on more real-world C++ programs. Finally we are looking for new (or existing) dynamic analyses that can be implemented using the flexible framework that Fjalar provides.
This research is supported by DARPA and a National Defense Science and Engineering Graduate Fellowship.
 Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering, vol. 27, no. 2, February 2001, pp. 1-25.
 Reed Hastings and Bob Joyce. Purify: A tool for detecting memory leaks and access errors in C and C++ programs. In Proceedings of the Winter 1992 USENIX Conference, pp. 125--138, San Francisco, California, January 1992.
 Benjamin Morse. A C/C++ front end for the Daikon dynamic invariant detection system. Master's thesis, MIT Department of Electrical Engineering and Computer Science, Cambridge, Massachusetts, USA, August 2002.
 Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. In Proceedings of the Third Workshop on Runtime Verification, Boulder, Colorado, USA, July 2003.