|
Ubiquitous Memory Introspection
Qin Zhao, Rodric Rabbah, Saman Amarasinghe, Larry Rudolph
& Weng-Fai Wong
Abstract
Modern memory systems play a critical role in the performance of applications,
but a detailed understanding of the application behavior in the memory
system is not trivial to attain. It requires time consuming simulations
and detailed modeling of the memory hierarchy, often using long address
traces. It is increasingly possible to access hardware performance counters
to count relevant events in the memory system, but the measurements are
coarse-grained and better suited for performance summaries than providing
instruction level feedback. The availability of a low cost, online, and
accurate methodology for deriving fine-grained memory behavior profiles
can prove extremely useful for runtime analysis and optimization of programs.
This work introduces a new methodology for Ubiquitous Memory Introspection
(UMI). It is an online and lightweight methodology that uses fast mini-simulations
to analyze short memory access traces recorded from frequently executed
code regions. The simulations provide profiling results at varying granularities,
down to that of a single instruction or address. UMI naturally complements
runtime optimizations and enables new opportunities for online memory
specific optimizations. We have developed a prototype runtime system implementing
UMI which is readily deployed on commodity processors, requires no user
intervention, and can operate with stripped binaries and legacy software.
The prototype has an average runtime overhead of 14 percent. This overhead
is only 1 percent more than a state of the art binary instrumentation
tool. We used 32 benchmarks, including the full suite of SPEC CPU2000
benchmarks, for evaluation. We show that the mini-simulations accurately
reflect the cache performance of two existing memory systems, an Intel
Pentium 4 and an AMD Athlon MP (K7). We also demonstrate that UMI predicts
delinquent load instructions with an 88 percent rate of accuracy for applications
with a relatively high number of cache misses, and 61 percent overall.
The online profiling results are used at runtime to implement a simple
software prefetching strategy that achieves an overall speedup of 64 percent
in the best case. UMI can also be used in the context of an online software
prefetcher. In many cases, our online software prefetcher can match the
performance of the hardware prefetcher available in the Pentium~4, and
in some cases, outperforms it. In the case of the AMD K7, UMI and software
prefetching deliver an 11 percent overall performance gain. This kind
of information is available only through exhaustive simulation.
References:
[1] Ubiquitous Memory Introspection. In Proceedings of the 2007 International
Symposium on Code Generation and Optimization (CGO), San Jose, CA,
March 2007.
|
|