Download your copy of the MRRL enhancements, to enable Memory Reference Reuse Latency within SimpleScalar ss3b_mrrl-0.0.1.tar.gz, by, John Haskins, Jr. and Kevin Skadron today!


What is Memory Reference Reuse Latency?

Memory Reference Reuse Latency (MRRL) refers to the distance (in completed instructions) between consecutive references to some memory location M[A]. By measuring the reuse latencies of each unique address accessed by a benchmark, we were able to select a point to begin cache and branch predictor warm up prior to each simulation sample cluster. Cache and branch predictor warm up assures accurate simulation; our delayed warm up technique achieves accurate simulation in less time than modeling all cache and branch predictor interactions prior to each sample cluster. For a more in-depth description, please refer to our technical report here and our ISPASS 2003 paper here.

What's the big picture and how well does MRRL work?

Very briefly... MRRL works very well for sampled simulations, because MRRL-driven warm up eliminates nonsampling bias just as well as fullwarmup which models all pre-cluster instructions.

Less briefly... Conte et al. showed that random cluster sampling is an approapriate technique for microarchitecture simulation that is amenable to statistical analysis. In random cluster sampling, a sample is drawn by choosing clusters of a fixed number of instructions from a benchmark's dynamic instruction stream such that no region of the end-to-end execution is more likely than any other of being included in the sample. Only these clusters are simulated in cycle-accurate detail.

For a well-chosen sample, simulation accuracy depends upon minimizing nonsampling bias which is accomplished by accurately establishing state within the simulated cache hierarchy and branch predictor prior to each cycle-accurate cluster simulation. The most straight-forward such warm up technique models all pre-cluster cache and branch predictor interactions. Fully warming up all cache and branch predictor interactions guarentees perfect state; hence, this approach is impervious to nonsampling bias. This approach is time consuming however, since cache and branch predictor interactions are expensive operations to model. MRRL exploits temporal locality by modeling only those interactions that occur nearest to the clusters, that are most likely to be relevent to the simulation of the clusters. Modeling fewer interactions gives MRRL its speed advantage.

Below, the SPEEDUP table gives the simulation running time of fullwarmup (in seconds) and MRRL as a percentage thereof. The ACCURACY table gives the end-to-end (i.e., true) IPC, the IPC achieved by fullwarmup on samples of 50 1-million-instruction clusters, and the MRRL IPC on the same samples as well as its percent-error deviation from the fullwarmup IPC. The true IPCs come from the SimPoint Web site; therefore all fullwarmup and MRRL simulations use their sim-outorder configuration.

SPEEDUP
fullwarmup MRRL
applu 78105 60.87%
apsi 120925 59.56%
crafty 78906 48.20%
equake 54675 57.11%
facerec 70587 51.98%
fma3d 96462 61.38%
galgel 157325 53.10%
gap 82896 53.10%
lucas 46730 50.67%
mcf 36014 46.45%
mgrid 142334 60.50%
swim 80946 54.81%
twolf 133069 44.04%
ACCURACY
true fullwarmup MRRL
applu 0.831 0.772 0.772 (0.00%)
apsi 1.008 1.039 1.039 (-0.01%)
crafty 0.569 0.548 0.595 (-0.02%)
equake 0.310 0.311 0.311 (0.00%)
facerec 1.446 1.376 1.378 (0.18%)
fma3d 0.535 0.533 0.554 (3.90%)
galgel 1.334 1.245 1.245 (-0.02%)
gap 0.750 0.678 0.678 (0.01%)
lucas 0.774 0.791 0.791 (-0.04%)
mcf 0.092 0.095 0.095 (0.00%)
mgrid 0.987 1.034 1.034 (-0.01%)
swim 0.694 0.716 0.716 (0.00%)
twolf 0.636 0.629 0.630 (0.13%)

As the ACCURACY table shows, MRRL deviates very little from the fullwarmup IPC, only straying by more than 0.2% for fma3d. Even though the fma3d percent-error is 3.9% however, we see that the absolute error (or difference) between the two is only 0.021 instructions per cycle---a tiny amount. The strength of random cluster sampling is demonstrated by the small deviations from the true IPC that were generated by simulating a meager 50 million instructions in cycle-accurate detail. Only three benchmarks (applu, galgel, and gap) deviate by more than 5%, and even this would have been further reduced by selecting more than 50 clusters for these benchmarks, since larger samples would more fully represent the population of clusters within the complete dynamic instruction stream.

Equally important, the SPEEDUP table shows that MRRL reduces the simulation running time (relative to fullwarmup) by more than 40% on most benchmarks.

What is ss3b_mrrl-0.0.1.tar.gz and how do I use it?

ss3b_mrrl-0.0.1.tar.gz is a tar-ed, gzip-ed archive, containing the source files for the MRRL tools. To unpack the archive, enter the SimpleScalar directory, gunzip and un-tar as in the example below:

Once this step is complete, all that remains is to set up the MRRL toolset distribution. Execute the included script as follows: which renames several of the out-of-the-box SimpleScalar files and sets up symbolic links to their MRRL-enabled equivalents. To reinstate the SimpleScalar files as they appeared before, execute the following:

which redirects the symbolic links to the original files. This process and information about other topics can be found in the README.mrrl file.

With what SimpleScalar version(s) will ss3b_mrrl-0.0.1.tar.gz work?

ss3b_mrrl-0.0.1.tar.gz was created against and therefore works quite well with SimpleScalar version 3.0b. I believe it also works for 3.0c, and suspect that it will work for 3.0a, but I have not confirmed either.

I have successfully deployed the archive. Now what?

Once the MRRL archive has been deployed into the SimpleScalar directory, you will have to compile both sim-mrrlprofile, and the multiple cluster, MRRL-enabled version of sim-outorder. Do this as normal:

The first step after compiling is to acquire profile data from the benchmark for each sample partition of the dynamic instruction stream. Each pre-cluster--cluster partition contains the actual sample cluster instructions, and the instructions immediately preceding. Consider the following example command-line (taken directly from README.mrrl). Here, the first parameter -fastfwd:inst give the profiler the number of instructions that precede each simulation cluster, relative to the start of execution; the second parameter -cluster:inst tells the number of instructions that composes the actual clusters within each pre-cluster--cluster partition. To translate, in the above example we are instructing the profiler that we want 2 pre-cluster--cluster partitions of the twolf benchmark. The first partition consists of instructions 1 through 100,000,000 and contains 90 million pre-cluster and 10 million cluster instructions. Similarly, the second slice consists of instructions 100,000,001 through 150,000,000, and contains only 40 million pre-cluster instructions and (again) 10 million cluster instructions. Furthermore, since the second and final cluster ends 150 million instructions after the start of execution, the third parameter -max:inst is used to end the profiling run early. The last parameters however, are standard SimpleScalar, such as one would enter to run sim-safe, e.g., After the profile data are collected (as here, in a file named sim-mrrlprofile.twolf), they are used to drive the multiple cluster, MRRL-enabled version of sim-outorder, thus: followed by In the first step, a simple Perl script named mrrl-mkconfig.pl extracts data from the original sim-mrrlprofile output file, sim-mrrlprofile.twolf and puts them into a format that is readily consumed by sim-outorder and places them into a new file named mrrl-twolf.config.

The second step executes the multiple cluster, MRRL version of sim-outorder. The first parameter -mrrl instructs the simulator to set up for MRRL profile-driven execution; 0.999999 indicates that we want warm up to commence according to the 99.9999-th percentile of MRRL measurements. The second parameter -mrrl:config points the simulator to the processed MRRL profile report, mrrl-twolf.config. The third and fourth parameters (-fastfwd:inst and -cluster:inst) are identical to their use with sim-mrrlprofile, and indicate the pre-cluster--cluster partitioning of the benchmark execution. Finally, the last parameters are standard SimpleScalar. When complete, sim-outorder.twolf-mrrl+0.999999 will contain a standard sim-outorder report, augmented with per-cluster IPCs (which are very useful for statistical analysis).

What are known glitches with MRRL?

Both sim-mrrlprofile and the MRRL-enabled sim-outorder can only accomodate samples containing 8192 or fewer clusters. The maximum number of clusters may be increased by changing the line:

in the sim-mrrlprofile.c and the sim-outorder_mrrl.c files and recompiling.

Also, sim-mrrlprofile is very RAM-hungry. This is because MRRL must track reuse latency for each individual memory address, M[A], separately for instruction references, data references, and instruction references that are to branches; naturally, this requires a very large data structure. Ideally, an associative array could be maintained for each of the three streams of references. Rather than suffer the O(n) worst-case overhead of associative lookup, sim-mrrlprofile keeps three deeply associative (32-way) arrays, that contain 128K sets apiece. Accesses to these arrays works almost precisely the same as a set-associative cache lookup: an O(1) hash is followed by an O(1) search. [NOTE: The search is not linear (i.e., O(n)) because the number of ways is fixed, not a function of the number of references that are stored.] This bounds the critical path of our lookup mechanism to 32 compares---one for each degree of associativity. Hence, the worst-case lookup time of sim-mrrlprofile is constant. While this organization occasionally causes some elements to be clobbered when a set becomes full, this does not happen very often due to the arrays' immense size.

Obviously, there is a trade-off between time and space in any computing application as any treatment of complexity theory will attest. We chose speed (of execution and implementation). The amount of space demanded by sim-mrrlprofile is not unreasonable for most modern machines. Each deeply associative array contains 4096K entries and each element occupies 16 bytes apiece, for a total of 64MB per array. After all three arrays have been allocated, that makes a total of 192MB of storage, which fits easily into a machine with 512MB of physical RAM.

Anything else worth noting?

Actually, yes. In light of the emergence of 3-level cache hierarchies (e.g., POWER4, Itanium, PA8700), sim-outorder has also been modified to accomodate a third level of cache. This third level is configured identically to the first two, and can be either unified or split in two for separate storage of instructions and data.

Also, to facilitate accurate measurement of cache and branch predictor behavior, we have implemented a mechanism for fine-grained control over statistics gathering in the cache hierarchy and branch predictor. Consider the following code-snippet from sim-outorder_mrrl.c:

The gather_stats field is a new field that we added to the struct bpred_t data type, which contains a set of flags that correspond to each statistic that is tracked by the bpred_lookup() and bpred_update() routines (e.g.,lookups, retstack_pops,addr_hits). Asserting a flag within the gather_stats field enables tracking of the corresponding statistic; reseting a flag disables tracking of the corresponding statistic. (Thus, since the constant STAT_BR_CLEAR equals 0, in the code snippet above all statistics tracking is disabled.) In sim-outorder_mrrl.c we disable all cache and branch predictor statistics tracking as the pre-cluster instructions begin simulating. As the cycle-accurate cluster simulation begins, we reenable this tracking as follows: This code asserts the flags of every statistic that is tracked by the bpred_lookup() and bpred_update() routines. Our additions and enhancements can be found in the cache_mrrl.[ch] and bpred_mrrl.[ch] source files and should prove useful for other experimentation with sampled simulation or any other scenario where the research needs fine-grained control over statistics gathering.

We also enhanced timing accuracy by placing a call to the UNIX getrusage() system call in main_mrrl.c. This also allows accuate timing data to be gathered even when the host system is heavily loaded running binaries other than the simulator. The time computed in this way is given as sim_cpu_time in the sim-outorder report.

Finally, we hacked options_mrrl.[ch] to include support for counter_t command-line options classes, both single and lists. This became necessary in light of the extremely long-running SPEC CPU2000 benchmarks. parser, for instance, on the reference inputs runs for more than 0.5 trillion instructions, which is substantially more than can fit into an unsigned 32-bit space!

How do I contact the authors?

I (John) currently maintain the MRRL tools, but this will become increasingly less possible since I have left graduate school and entered full-time employment. Until I coordinate with my advisor (Kevin) a new student to maintain the code, I can be reached at predator@cs.virginia.edu. I cannot guarentee a timely response, but will do my best to respond to all requests, questions, and bug reports in a timely fashion.

Latest version: 31 January 2003


MRRL enhancements and utilities. Copyright (C) 2002,2003 John Haskins, Jr. and Kevin Skadron.