A few thoughts of Memory Benchmarking....

Rick Hetherington (rick.hetherington@Eng.Sun.COM)
Fri, 27 Sep 1996 15:35:10 -0700

I second John Henning's comments on simplicity. I do believe that the sheer
simplicity of this benchmark was one key element that enabled wide
acceptance in the engineering community. We should make every effort to
maintain that approach.

Another attractive feature of STREAM is that it is virtually snooker
proof. It tends to annoy the snooker patrols and that is just fine with me.



>(1) Should STREAM be extended to automagically measure bandwidths at each
> level of a memory hierarchy? What is a robust way of doing this with
> a single, portable piece of source code?

Yes, indeed, this should be a goal and I don't know the robust way of doing
this yet.

(2) Should STREAM be extended to automagically measure latencies at each
level of a memory hierarchy? How would this be different than what
Larry McVoy does with lmbench/lat_mem_rd? Is it possible or desirable
to measure a "different kind" of latency than lmbench measures?

Again, this is a definite goal. Lmbench/lat_mem_rd is just fine and I
would like to see it extended to measure average latency in an MP environment.
Depending on the system topology, generate multiple lat_mem_rd and measure
average latency per processor and report that for 2-way, 4-way, etc. Just
run the same dependent load loop on each processor, wait till all processors
are in steady state and take a timing measurement on all and report the mean.
Also one more item that is key here is Cache to Cache latency, and again
depending on the system topology, we should measure local cache to cache
and global cache to local cache, and perhaps global to global cache latency.

(3) Should STREAM be extended to automagically measure bandwidths and
latencies across distributed memory systems, such as the Convex cc-NUMA
machines, and the future SGI "Scalable Node" products? What about
distributed memory machines without global addressability? (I am not
interested in an MPI "solution"!!!!!)

Absolutely, in fact the current STREAM has a bit of a hole in that NUMA
machines can define their data sets to live locally and never contend for a
common resource...not exactly the typical behaviour found in real
applications. The system benchmark should attemp to measure "scalable node"
products which are globally addressable but steer clear of clustered systems
using widgets like Memory Channel or ethernet for message passing. There may
be a specialized benchmark that could be developed but it is pointless to
run the same benchmark and a number of clustered machines that cause no
traffic between the clustered nodes.

(4) If we are looking at multiprocessor machines with some kind of network
rather than a shared bus, do we want to look at "peak point-to-point"
performance, or performance in the presence of contention? How do we
want to define the communication patterns that will create that contention?
Can this be done in a way that is not biased for or against any particular
network architecture? (Unless that bias seems "reasonable"?)

If all processing nodes can address all of memory where ever it may exist,
contention for common resources will naturally occur. I don't think we need to
precisely script a collison of hand selected ops to create contention. MP
system validation engineers will tell you how difficult this is.
Determine the local, global latency bump and generate that on as many nodes as
possible and measure average latency and banwidth as above with the lat_mem_rd
and a cache to cache test.

(5) In a completely different direction -- how do we address the question of
how much a memory system characterization can be useful for application
performance prediction? How much detail is needed to predict the
performance of SPECf95 to within acceptable margins of error, for example?
I have done a bit of this using ad hoc curve fitting of available numbers,
but a "first principles" approach would be preferred.

The numbers seem to be all over the map right now but when we measure
every possible relevant memory parameter, I'm sure an appropriate spread sheet
could be developed that would accurately predict overall performance.