STREAM for T3D

From: Charles Grassl (cmg@ferrari.cray.com)
Date: Mon Jan 30 1995 - 16:49:42 CST


Dear John;

Enclosed are updated results for the CRAY T3D. Note that there are two
sets of results: one with assembly language coded SUM and TRIAD and one with
straight Fortran versions.

The TRIAD which I used is not SAXPY from libsci. The used TRIAD does
the correct handling of C <- A + s*B.

 *** STREAM benchmark ***
 Number of iterations: 2
 Size of Arrays: 501 Kwords

                                            Bandwidth (Mbyte/s)
 Machine Copy Scale Sum Triad
 --------------- -------- -------- -------- --------
*CRAY T3D 256 PEs 98303.7 84229.5 57622.6 56078.7
*CRAY T3D 128 PEs 49132.9 42113.9 28811.1 28032.5
*CRAY T3D 64 PEs 24577.9 21061.6 14405.8 14020.2
*CRAY T3D 32 PEs 12288.6 10530.7 7204.2 7010.7
*CRAY T3D 1 PEs 384.5 329.4 225.4 220.1

 CRAY T3D 256 PEs 98316.2 84241.5 47824.1 45248.1
 CRAY T3D 128 PEs 49156.4 42128.2 23912.0 22625.2
 CRAY T3D 64 PEs 24580.7 21064.7 11955.7 11312.9
 CRAY T3D 32 PEs 12290.3 10533.4 5978.3 5656.4
 CRAY T3D 1 PEs 384.2 329.2 187.0 176.8

* Use assembly language SUM and TRIAD

Let me explain the results.

On the T3D PEs (EV 21064, or Alpha, microprocessors from DEC), the
overall bandwidth is generally dominated by the load rate. Store rates
are different form LOAD rates.

     Table 1. MEMORY LOADS. A page refers to one "unit" of memory
     parts, either 4096 bytes or 2048 bytes, depending on which type of
     memory chip is used. A cache line is 32 bytes long.

                              Clocks Rate
     Condition per cache line (Mbyte/s)
     --------------------------------------
     Read ahead, page hit 15 320
                 page hit 21 228
                 page miss 37 130

     Table 2. MEMORY STORES. A stream is a store of contiguous memory
     elements. The EV 21064 has four write "silos" or buffers. The
     algorithm for handling theses buffers is characterized by the data
     given.

                        Clocks Rate
     Number of streams per word (Mbyte/s)
     ---------------------------------------
            1 2.6 462
            2 7.2 167
            3 7.9 152
            4 29 42
            5 29 42

For each of the STREAM tests, Copy, Scale, Sum and Triad, we have a
combination of loads and stores. If only one load is needed, then the
compiler and loader can implement and utilize a "read ahead" mode. The
current compiler and loader do not do this for two loads, as is
required in Sum and Triad.

For Copy, there is one load and one store. The bandwidth is a linear
combination of 320 Mbyte/s (load with read ahead) and 462 Mbyte/s
(store with one stream).

Scale is approximately the same as Copy, but with a small "noise" from
using one cycle for a floating multiply.

For Sum and Triad, there are two loads and one store. The bandwidth is
"some" linear combination of 228 Mbytes/ ("no read ahead", but with a
page hit) and 468 Mbyte/s for one store. The arithmetic does not work
so well for the "two load" cases, but the speed is dominated by the
load rate for no read ahead (or two inputs).

John, I very much appreciate your explanation of the Power Challenge
bandwidth. I had previously thought the that the SGI results in your
tables were incorrect.

You are correct in specifying the "burst" bandwidth, which is different
from sustainable bandwidth. For example, the "burst" bandwidth on a
T3D node is as high as 1200 Mbyte/s, but this only lasts for one clock
period. Sustainable, and usable, bandwidth depends on using cache
lines. The sustained bandwidth further depends on constructs.

The concepts of "burst" and "sustainable" also apply to floating point
units, both for RISC and for vector architectures. For example, on a
Y-MP, the burst rate for a floating point multiply or a floating point
add is 1 per clock, or 167 Mflop/s. But if we want the rate for 65 or
more results, then the rate is 64/68 results per clock period. The
64/66 ratio is attributable to the functional units being reserved for
2 extra clock periods. (I believe that the sustainable rate for the
J90 is 64/65 results per clock period.)

John, do you know what is the "theoretical" sustainable bandwidth for
an RS/6000 590? Subhas Sani, from NASA Ames, quotes it as 2100 Mbyte/s
(66 MZ * 4 words * 8 bytes). But the entry in you table indiates 650
Mybes/s. This is far from 2100 Mbyte/s, so I suspect that the 2100
number is a burst speed. Also, it could be the rate which include
rewriting cache.

Regards,
Charles Grassl
Cray Research, Inc.
cmg@cray.com
(612) 683-3531



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:04 CDT