John's issues

Preston Briggs (preston@tera.com)
Wed, 25 Sep 1996 21:42:45 -0700

Here's some thoughts on John's initial set of questions. I'm afraid
my comments are pretty Tera-specific since that's what I care about
these days, but hopefully others will think and write about other
systems.

(1) Should STREAM be extended to automagically measure bandwidths at each
level of a memory hierarchy? What is a robust way of doing this with
a single, portable piece of source code?

Obviously won't work on the Tera, with only 1 level of memory. On the
other hand, we should make sure it doesn't give some sort of spurious
answer.

(2) Should STREAM be extended to automagically measure latencies at each
level of a memory hierarchy? How would this be different than what
Larry McVoy does with lmbench/lat_mem_rd? Is it possible or desirable
to measure a "different kind" of latency than lmbench measures?

I guess I need to study McVoy's stuff. To measure latency on a Tera,
I think we'd need to time a simple loop, compiled with flags to
prevent parallelization and software pipelining. Easy enough for me
to do, but I'm not sure how easy it'd be to write a general-purpose
piece of code that would do this for all sorts of machines.

(3) Should STREAM be extended to automagically measure bandwidths and
latencies across distributed memory systems, such as the Convex cc-NUMA
machines, and the future SGI "Scalable Node" products?

Hmmm, I was going to say this would be hard, but actually it's easy on
our machine. It'll just be twice the latency measured above. Again,
I haven't a clue how to do it in the general case.

(4) If we are looking at multiprocessor machines with some kind of network
rather than a shared bus, do we want to look at "peak point-to-point"
performance, or performance in the presence of contention?

I'd think this would be useful. Not sure how to arrange it. I guess
one approach would be to saturate each processor with tight loops that
access random memory locations at every instruction.

Bill Broadley <bill@math.ucdavis.edu> wrote:
>I think if your looking for peak observable practical memory bandwidth
>to doing something more interesting then memcpy/bzero then stream does
>a good job, for a memstone I'd say (assign+Scale+sum+saxpy)/4 would

I disagree. All four of these (along with memcopy and bzero) will
give the same answer on our machine and, I expect, most
cache-dependent machines. Need to do something more interesting than
simple stride-1 vector operations.

Which is my biggest complaint about STREAM. Throw in some sort of
cache buster, e.g., a histogram

for (i = 0; i < n; i++)
bucket[key[i]]++;

where the size of the bucket array is fairly hefty (much bigger than
cache) and the values in the key vector are uniformly random.

Regarding C and Fortran... I don't care; won't make any difference to
me. Probably Broadley's argument that there's no free Fortran
compiler is pretty telling, though a fair amount of care will be
required to write C that can be generally compiled as well as Fortran.

Preston