STREAM results

From: Preston Briggs (preston@tera.com)
Date: Sun Mar 31 1996 - 02:30:30 CST


Hi John,

I finally took the time to run your STREAM benchmark (Fortran version)
through our compiler and do some simulations. I had to make one
change to the code. You call the function second() everywhere,
passing dummy. Our compiler says, "Well, it's a function so it might
change things like common variables and such, but I don't see any of
that here, so I'm going to rearrange things to suit myself." And it
takes your big loop

      DO 70 k = 1,ntimes
          t = second(dummy)
          DO 30 j = 1,n
              c(j) = a(j)
   30 CONTINUE
          t = second(dummy) - t
          times(1,k) = t

          t = second(dummy)
          DO 40 j = 1,n
              b(j) = scalar*c(j)
   40 CONTINUE
          t = second(dummy) - t
          times(2,k) = t

          t = second(dummy)
          DO 50 j = 1,n
              c(j) = a(j) + b(j)
   50 CONTINUE
          t = second(dummy) - t
          times(3,k) = t

          t = second(dummy)
          DO 60 j = 1,n
              a(j) = b(j) + scalar*c(j)
   60 CONTINUE
          t = second(dummy) - t
          times(4,k) = t
   70 CONTINUE

and rewrites it as

      do k = 1, ntimes
        t = second(dummy)
        t = second(dummy) - t
        times(1, k) = t
        t = second(dummy)
        t = second(dummy) - t
        times(2, k) = t
        t = second(dummy)
        t = second(dummy) - t
        times(3, k) = t
        t = second(dummy)
        t = second(dummy) - t
        times(4, k) = t
      enddo
      doall j = 1, n
        do k = 1, ntimes
          c(j) = a(j)
          b(j) = scalar * c(j)
          c(j) = a(j) + b(j)
          a(j) = b(j) + scalar * c(j)
        enddo
      enddo

which isn't exactly what you intended! Adding insult to injury, the
regular optimizer gets hold of the second loop nest and hoists all the
loads and stores out of the inner loop. Anyway, I changed all the
invocations of second(dummy) to things like second(a), second(b), and
so forth.

Since we still don't have a machine (damn!), I've run things on our
simulator. It's supposed to accurately model everything though,
including packets hopping around the network, any collisions or
hotspots, etc. I'm assuming a clock rate of 300 MHz. Won't know that
until the hardware runs, but we're actually hoping for better (say
333MHz or so).

I did several runs, experimenting with various values of "n" and
different sizes of system. To get you warmed up, I've included
results for a one processor system. I'll get some more to you soon.

Looking at the code, each of the loops should peak out at 2400 MB/s
per processor. We're a little slower because of scalar overheads
(time to get all the parallel threads up and chugging). If the
simulator wasn't so slow, I expect I could scale "n" to bring the
achieved rate arbitrarily close to the asymptotic rate.

----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 200000
 Offset = 0
 The total memory requirement is 4 MB
 You are running each test 2 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 2 microseconds
 The tests below will each take a time on the order
 of 1457 microseconds
    (= 729 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 2133.7553 0.0015 0.0015 0.0015
Scaling : 2190.8705 0.0015 0.0015 0.0015
Summing : 2202.6162 0.0022 0.0022 0.0022
SAXPYing : 2199.1884 0.0022 0.0022 0.0022
 Sum of a is : 90000000.
 Sum of b is : 18000000.
 Sum of c is : 24000000.

The above is for one processor, compiled especially to run on one
processor. For more a more general version capable of running on many
processors, the overhead is a little higher. I did a few tests with
larger machines and the same work, but they made it look like the
machine scaled horribly. I'm redoing them now, slowly and painfully,
doubly the amount of work each time the number of processors doubles.

Here's my predictions. We'll see what actually happens. I'm not
actually sure we'll be able to simulate the larger cases, but it
hasn't crashed yet.

           copy sum n

1 2,062 x1 2,090 x1 200,000
2 3,969 x1.92 4,026 x1.92 400,000
4 7,640 x3.70 7,740 x3.70 800,000
8 14,715 x7.13 14,964 x7.16 1,600,000
16 27,991 x13.57 28,993 x13.87 3,200,000
32 50,648 x24.56 53,504 x25.6 6,400,000

What do people do on the other parallel machines?
Do you care how big they make n?

Preston



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:05 CDT