MRRL: Memory Reference Reuse Latency ==================================== Copyright (C) 2002, 2003 John W. Haskins, Jr. and Kevin Skadron * INTRODUCTION -------------- Memory reference reuse latency (MRRL) as the name implies is the amount of time---for our purposes measured in number of completed instructions---elapsed between successive accesses to a unique location in memory, M[A]. My research uses MRRL measurements of addresses in the instruction stream, data stream, and branch stream (a proper subset of the instruction stream) in order to estimate the number of instructions prior to a cycle-accurate SimpleScalar simulation sample cluster that need to be warmed up to closely approximate warming up all pre-cluster instructions. (Please visit ftp://ftp.cs.virginia.edu/pub/techreports/CS-2002-19.ps.Z for a copy of the MRRL technical report.) We have made extensive modifications to SimpleScalar v3.0b that allow sampled benchmark execution: swapping between cold (low-detail), warm (low-detail plus cache & branch predictor modelling), and hot (full, cycle-accurate detail) according to MRRL profile data. * INSTALLATION -------------- The 'ss3b_mrrl-x.y.z.tar.gz' archive comes with a simple shell script---'mrrl_go.sh'---which can be used to install and undo source code changes that enable the multiple-cluster MRRL simulations. To deploy the MRRL enhancements, run mrrl_go.sh deploy This will rename the originals of the affected files thus: main.c -> main_ss3b.c, bpred.c -> bpred_ss3b.c, bpred.h -> bpred_ss3b.h, ... and create symbolic links to the MRRL versions of the files thus: main.c -> main_mrrl.c, bpred.c -> bpred_mrrl.c, bpred.h -> bpred_mrrl.h,... The originals are preserved in this way so that users may roll back to "out-of-the-box" SimpleScalar v3.0b quickly and easily by running mrrl_go.sh rollback This will of course, replace the symbolic links to the MRRL version of the source files with links to the originals thus: main.c -> main_ss3b.c, bpred.c -> bpred_ss3b.c, bpred.h -> bpred_ss3b.h,... Crude failsafes have been included in the 'mrrl_go.sh' script that will avoid some incidents of clumsiness. These failsafes are centered around special read-only files that 'mrrl_go.sh' creates: 'nevel_mrrl_deployed', 'not_mrrl_deployed', 'mrrl_deployed'. The first is comes with the 'ss3b_mrrl-x.y.z.tar.gz' archive and prevents running mrrl_go.sh rollback on a SimpleScalar tree that has never received the MRRL enhancements. 'not_mrrl_deployed' is created after the deployment is undone; and 'mrrl_deployed' is created after the enhancements are successfully deployed and prevents them from being deployed again. * USAGE ------- Before MRRL-style sampled simulation can proceed, profile data must be acquired from the benchmark for each pre-cluster--cluster pair. A pre-cluster--cluster pair is a slice of a benchmark's execution, which itself, is split into those instructions that precede the simulation cluster and the cluster. The MRRL profiler---like the MRRL-enabled version of 'sim-outorder'---is designed to accommodate multiple pre-cluster--cluster slices. Consider the following example command-line: sim-mrrlprofile -fastfwd:inst 90000000 140000000 \ -cluster:inst 10000000 \ -max:inst 150000000 \ -redir:prog output.twolf \ -redir:sim sim-mrrlprofile.twolf \ bin.alpha/twolf.peak ./twolf_ref The first parameter "-fastfwd:inst" gives the profiler the number of instructions that precede each simulation sample cluster relative to the start of execution; the second parameter "-cluster:inst" tells the number of instructions that composes the actual clusters within each pre-cluster--cluster partition. Hence, above we are telling the MRRL profiler that we want 2 pre-cluster--cluster partitions of the 'twolf' benchmark. The first slice consists of instructions 1 through 100,000,000, and contains 90,000,000 pre-cluster and 10,000,000 cluster instructions. Similarly, the second slice consists of instructions 100,000,001 through 150,000,000, and contains only 40,000,000 pre-cluster and (again) 10,000,000 cluster instructions. Since the second and final cluster concludes 150,000,000 instructions after the start of execution, the third parameter "-max:inst" is used to end the profiling run early. (When executed with the multiple-cluster MRRL version of 'sim-outorder', the instructions in each cluster will be executed in full, cycle-accurate detail.) The last parameters are standard SimpleScalar, such as one would use to run 'sim-safe', e.g., sim-safe -max:inst 150000000 \ -redir:prog output.twolf \ -redir:sim sim-safe.twolf \ bin.alpha/twolf.peak ./twolf_ref Once the profile data have been collected (here, in a file named 'sim-mrrlprofile.twolf'), they can be used to drive the multiple cluster, MRRL-enabled version of 'sim-outorder', thus: mrrl-mkconfig.pl sim-mrrlprofile.twolf > mrrl-twolf.config followed by sim-outorder -mrrl 0.999999 \ -mrrl:config mrrl-twolf.config \ -fastfwd:inst 90000000 140000000 \ -cluster:inst 10000000 \ -max:inst 150000000 \ -redir:prog output.twolf \ -redir:sim sim-outorder.twolf-mrrl+0.999999 \ bin.alpha/twolf.peak ./twolf_ref In the first step, a simple Perl script named 'mrrl-mkconfig.pl' extracts data from the original 'sim-mrrlprofile' output file, 'sim-mrrlprofile.twolf' and puts them into a format that is readily consumed by 'sim-outorder' and places them into a new file named 'mrrl-twolf.config'. The second step executes the multiple cluster MRRL version of 'sim-outorder'. The first parameter "-mrrl" instructs the simulator to set up for MRRL profile-driven execution; 0.999999 indicates that we want the 99.9999th percentile of MRRL references. The second parameter "-mrrl:config" points the simulator to the processed MRRL profile report, 'mrrl-twolf.config'. The third and fourth parameters "-fastfwd:inst" and "-cluster:inst", respectively are identical to 'sim-mrrlprofile' and indicate the pre-cluster--cluster partitioning of the benchmark execution. Finally, the last parameters are once again, standard SimpleScalar. NOTE: In light of the emergence of 3-level cache hierarchies (e.g., IBM's POWER4), I also modified 'sim-outorder' to accommodate a third level of cache. This third level is configured identically to the first two, and can be either unified or split in two for separate storage of instructions and data. * CONCLUSION ------------ The process should be very straight-forward. First the user must determine the pre-cluster--cluster partitioning of the benchmark. This data is then used to drive the MRRL profiling of that part- itioning. Finally, with the profile completed, the user needs to process the raw profile (using 'mrrl-mkconfig.pl') and execute sampled simulations with the multiple cluster, MRRL-enabled 'sim-outorder'. Profiling and profile processing are a one-time cost for each benchmark pre-cluster--cluster partitioning. With these data, sampled simulation can begin, and large state-space searches of the microarchitecture design parameters can procede rapidly and accurately with confidence that cold-start bias will not adversely impact simulation output. * SALUTATION ------------ Happy sampled, MRRL simulating! - JHJr., predator@cs.virginia.edu