email@example.com (John D. McCalpin) -
I have a question on your bandwidth-limited Mflop benchmarks.
Your benchmark measures bandwidth to global memory, without allowing
cache-based processors to get benefits from reusing data in local
memory. I understand that this is an accurate reflection of many large
dense matrix applications.
Yet you apparently allow the CM-2 to access local memory.
Why not measure the saxpy in the a(i) = a(i) + scalar*b(i) format?
This configuration of saxpy is frequently used. This would remove the
penalty on cached machines' pre-read on a(i). Alternately, why not
insist that at least one of a, b, or c be on a different processor in
the CM-2? This would seem to be true when using saxpy for matrix
multiply or Gaussian elimination.
Thank you for your attention and response.
- Bob Greiner, Motorola Computer Group, firstname.lastname@example.org
Personal opinion only
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:02 CDT