Scalable C/C++ Value Profiling Via Mixed-Level AnalysisPhilip J. Guo, Stephen McCamant & Michael D. ErnstWhatWe are continuing the development of a tool to perform value profiling of C and C++ programs. Value profiling is a technique to observe the run-time values of variables and expressions in a program. Such examination is key to any dynamic analysis which attempts to understand a program by watching its computations as they occur. One example application is the Daikon dynamic invariant detector [1], which uses traces of variable values to infer program properties that are likely to be invariants. Our tool, named Kvasir (after the Norse god of knowledge and beet juice), serves as a C/C++ front-end for Daikon by executing programs, observing the values of variables at procedure entrance and exit points, and outputting the data to a trace file for Daikon to analyze. Experimental evaluation has shown that Kvasir is quite robust and scalable, working on a wide range of large open-source Linux C/C++ programs including XEmacs (205,000 lines of code), and GCC (302,000 lines). WhyKvasir was developed to fulfill the need for a scalable and robust C/C++ front-end for Daikon. Daikon is language-independent and requires separate language-specific front-ends (also available for Java, Perl, and some other languages) to provide it with traces of values which it can analyze to infer invariants. Kvasir was developed as a replacement for our old C/C++ front-end, named Dfec [3], which was based on a source-rewriting technique. Dfec had major scalability limitations, many caused by the complexities of dealing with C/C++ source code, which prevented it from working on any programs of significant size or complexity. In order to overcome Dfec's limitations, Kvasir was designed using a mixed-level approach which combines binary translation for program observation with debugging information for mapping memory-level information to language-level terms to present as output. HowTraditional dynamic analyses have either worked at the source level by performing source code analysis/rewriting (e.g. Dfec) or at the binary level by performing binary rewriting (e.g. memory-safety tools such as Purify [2]). We feel that neither approach is optimal for the domain of analyses to which value profiling belongs. The freedom that C and C++ give to programmers to control the contents of memory makes it challenging for value profiling tools to determine which regions are safe to dereference, which variables are initialized, and what they refer to. In order to address these challenges, we propose a mixed-level approach which combines some of the key advantages of both source- and binary-based techniques. Kvasir is implemented via such a mixed-level approach which combines binary rewriting using Valgrind, memory safety enforcement using Valgrind Memcheck, and the use of source-level information from compiler-generated debugging information. Kvasir performs binary rewriting of the target program (the program being analyzed) using the Valgrind framework [4], which allows analysis tools to supervise the execution of Linux programs on a virtual x86 processor. It inserts code at each procedure entry and exit point (called program points) to pause the target program's execution. While it has taken control from the program, Kvasir observes the contents of memory in order to obtain variable values and records these values to a trace file which Daikon can later process. Memory safety is crucial when performing value profiling because C and C++ are relatively low-level languages that allow the program to manipulate memory contents in an unrestricted manner. For instance, a local variable declared as an 'int*' might point to an unallocated region of memory, an uninitialized (garbage) value, and even if it were allocated and initialized, could refer either to a single integer or to an array of any size. Kvasir must not attempt to dereference pointers which refer to unallocated regions of memory (which cause the program to crash) nor output uninitialized data to the trace file (because they are semantically meaningless). Kvasir uses the Valgrind Memcheck tool to provide these memory safety guarantees. Memcheck works by keeping along with each byte of memory 1 A-bit (only set when the byte has been allocated) and 8 V-bits (only set when the respective bit has been initialized with valid data). Given information about a variable and its address in memory, Kvasir can use Memcheck to figure out whether that location has been allocated (querying A-bits), and if so, whether its contents have been initialized by the program (querying V-bits). The final component in our mixed-level approach for Kvasir is its use of debugging information as a surrogate for source code to obtain language-level information about variables, types, and functions. Debugging information is much easier to parse than source code and is relatively language-independent: for instance, the GCC family of compilers generates similar debugging symbols (according to the DWARF standard) for all the languages it supports, including C, C++, Fortran, Ada, and others. Kvasir utilizes debugging information to map between data it collects at the memory level and language-level terms such as variable and function names which it reports to the trace file. Progress
Kvasir has been under development since February 2004 and is now fairly robust and scalable. The table above shows some programs on which it has run successfully. Our previous C/C++ front-end, Dfec, could only run on programs as large as the smallest one listed on this table, but often required extensive modifications to the target program's source before it could work. By contrast, Kvasir worked without any changes to most of the programs tested. On the two largest programs, small changes were required: for xemacs, it was necessary to copy a file and change the Makefile to disambiguate static functions from the same file compiled twice, and for gcc, to specify a configuration option to choose a malloc-based garbage collector. FutureBecause Kvasir is built upon the Valgrind framework, it has the same operating system and architecture limitations as Valgrind itself: the current version works only with x86 processors under Linux. However, new versions which will work on more operating systems (FreeBSD) and platforms (PowerPC, AMD64) are currently under development. One future goal is to move Kvasir to these newer versions of Valgrind to improve its portability. Kvasir's current C++ support is sufficient for many purposes, but lacks some desirable features, such as tracing member function definitions inside a class body, and special treatment for the Standard Template Library. We will consider such features based on demand as our community of C++ users grows. Finally, we are applying the same mixed-level approach to the problem of dynamic comparability analysis of C and C++ programs; the results of such an analysis will complement Kvasir's output to improve the performance and scalability of Daikon. Research SupportThis research was supported by NSF grants CCR-0133580 and CCR-0234651, IBM, Edison Design Group, and an NSF graduate fellowship. References:[1] Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering, vol. 27, no. 2, February 2001, pp. 1-25. [2] Reed Hastings and Bob Joyce. Purify: A tool for detecting memory leaks and access errors in C and C++ programs. In Proceedings of the Winter 1992 USENIX Conference, pp. 125--138, San Francisco, California, January 1992. [3] Benjamin Morse. A C/C++ front end for the Daikon dynamic invariant detection system. Master's thesis, MIT Department of Electrical Engineering and Computer Science, Cambridge, Massachusetts, USA, August 2002. [4] Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. In Proceedings of the Third Workshop on Runtime Verification, Boulder, Colorado, USA, July 2003. |
|||||||||||||||||||||||||||||||||||||||||||||
|