Standard Memory Bmark needed Now

John, dtn 381-0378 21-Sep-1996 0720 (henning@perfom.ENET.dec.com)
Sat, 21 Sep 96 08:17:21 EDT

First, introductions
John Henning
CSD Performance Group
Digital Equipment Corporation
henning@zko.dec.com

I've been at Digital forever (my wife moved only twice in her life,
never wants to relocate again :-) but for variety I have dabbled in
4GLs, compilers, databases, management, device drivers, graphics, human
factors, etc. But my first and most frequent topic of interest is
performance, and for the last 2.5 years I've been (back) in an official
performance group, where I've specialized in SPEC CPU95 (where *alpha*
has the record, sorry all you other guys) and preached the virtues of
memory bandwidth throughout the corporation. Rick Hetherington will
remember my lectures to various managers (as well as engineers) to try
to popularize this topic.

Well, John McCalpin's simple, easy-to-understand benchmark has done its
job, at least within Digital. Hardware engineers know what their
McCalpin goals are for new platforms (just like they know what their
SPEC goals and TPC goals are). So whatever we do with the benchmark, I
would urge that we not lose that simplicity. If we expand the benchmark
to measure bandwidth at many levels and latency at many levels, there
will have to be some sort of simple summary MemMark, or one metric will
have to remain Primary whilst others are only for those who desire more
details.

We should look into getting the benchmark to be made more official.
Today, there's a rule that allows one to derive the McCalpin rating
without doing the measurement: take whatever a vendor claims as the
bandwidth and multiply by 0.5. But as memory systems evolve, this rule
may not hold up. So a standardized benchmark is needed.

The benchmark code also needs to be put through a process of peer review
and critique. John McCalpin will recall that I analyzed 5 versions of
Stream about 14 months ago and provided commentary, which I think led
directly to John's retiring a couple of the versions. If others also
provide line-by-line critique, we'll end up with a better benchmark.

The benchmark code needs to have a discipline of version numbering and
dates of editting. There should be *one* line which, upon being
editted, causes a new version number to be made visible to both the
person reading the source code and the person running a pre-compiled
version of the benchmark.

The benchmark needs to measure what it intends to measure, without
un-needed complexity. For a counter-example, see the Vorst/Dongarra
benchmark "benchm", which measures more than 20 interesting loops, but
only the VERY careful reader will see that the caches are left FAR TOO
WARM and therefore its results have a very different meaning than what
you think they do. Stream's code needs to be clear and simple and easy
to read.

It was nice that Stream had both a "C" version and a FORTRAN version
over the years. But if we are going to extend it, let's pick one and
stick with it. It will be too much of a headache if we add lots of
features to both versions, and try to ensure both are really measuring
the same thing.

I assert that the proper language to pick is FORTRAN, as that is the
only language used for Scientific programming. All those scientists
writing in "C" are really just writing the same FORTRAN programs they
used to write, with slightly modified syntax, and with compilers forced
to be afraid of their pointers.

Whatever we do should be *supported* by the consortia relations people
in our respective companies. SGI sometimes seems to have a strategy of
participating only on rare occasions when the benefit is remarkably
clear to the company. John McCalpin, do you have the support you would
need to push this forward?

More specific comments follow

>(1) Should STREAM be extended to automagically measure bandwidths at each
> level of a memory hierarchy? What is a robust way of doing this with
> a single, portable piece of source code?

I think a portable way would find it hard to detect all the transitions
between levels - sometimes associative caches, streaming caches, etc may
obscure the transition points. This is a "feature", a good thing not a
bad thing; but it would make it hard for a benchmark to say the Mumble
Processor has a 3-level memory hierarchy and here are the 3
corresponding ratings.

So perhaps instead you should pick Tiny, Small, Medium, and Large
datasets. Tiny would be picked at a size that is going to fit onto most
any imaginable on-chip cache. Large would be picked at a size that is
going to miss almost any on-chip cache. Set it in the source code to
default to 1GB, and then include a benchmarker's comment to the effect
that you are allowed to reduce it but may not reduce it to less than 4x
the size of the largest cache. Small and Medium would be within ranges
that are commonly cached, but not always - say, 1MB and 8MB.

> (3) Should STREAM be extended to automagically measure bandwidths and
> latencies across distributed memory systems, such as the Convex cc-NUMA
> machines, and the future SGI "Scalable Node" products? What about
> distributed memory machines without global addressability? (I am not
> interested in an MPI "solution"!!!!!)
>(4) If we are looking at multiprocessor machines with some kind of network
> rather than a shared bus, do we want to look at "peak point-to-point"
> performance, or performance in the presence of contention? How do we
> want to define the communication patterns that will create that contention?
> Can this be done in a way that is not biased for or against any particular
> network architecture? (Unless that bias seems "reasonable"?)

Stream should never lose sight of its original perspective as
"programmer-perceived" memory bandwidth, where the programmer is doing
relatively vanilla high-level language programming. Living with future
memory designs may introduce complexity into the lives of programmers,
and I wonder if Stream can somehow tell you "here's what you get if you
ignore that complexity"? Or maybe the more 'interesting' memory designs
get carried as a separate category.

One final note. Majordomo sometimes obscures the SENDER of messages
(e.g. reports all senders as something like "list-owner"). This may be
the fault of one of the gateways between me and the majordomo server.
But anyway, I would urge that email to this list always include a
signature at the bottom, not relying on the email headers. Thanks.

/John Henning
CSD Performance Group
Digital Equipment Corporation
henning@zko.dec.com
Speaking for myself, not Digital