**Next message:**Norbert Juffa: "STREAM results Pentium, 180 MHz"**Previous message:**Charles Grassl: "CRAY J932 stream results"**Next in thread:**jerry (j.) quinn: "STREAM results"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ]

Hi John,

I finally took the time to run your STREAM benchmark (Fortran version)

through our compiler and do some simulations. I had to make one

change to the code. You call the function second() everywhere,

passing dummy. Our compiler says, "Well, it's a function so it might

change things like common variables and such, but I don't see any of

that here, so I'm going to rearrange things to suit myself." And it

takes your big loop

DO 70 k = 1,ntimes

t = second(dummy)

DO 30 j = 1,n

c(j) = a(j)

30 CONTINUE

t = second(dummy) - t

times(1,k) = t

t = second(dummy)

DO 40 j = 1,n

b(j) = scalar*c(j)

40 CONTINUE

t = second(dummy) - t

times(2,k) = t

t = second(dummy)

DO 50 j = 1,n

c(j) = a(j) + b(j)

50 CONTINUE

t = second(dummy) - t

times(3,k) = t

t = second(dummy)

DO 60 j = 1,n

a(j) = b(j) + scalar*c(j)

60 CONTINUE

t = second(dummy) - t

times(4,k) = t

70 CONTINUE

and rewrites it as

do k = 1, ntimes

t = second(dummy)

t = second(dummy) - t

times(1, k) = t

t = second(dummy)

t = second(dummy) - t

times(2, k) = t

t = second(dummy)

t = second(dummy) - t

times(3, k) = t

t = second(dummy)

t = second(dummy) - t

times(4, k) = t

enddo

doall j = 1, n

do k = 1, ntimes

c(j) = a(j)

b(j) = scalar * c(j)

c(j) = a(j) + b(j)

a(j) = b(j) + scalar * c(j)

enddo

enddo

which isn't exactly what you intended! Adding insult to injury, the

regular optimizer gets hold of the second loop nest and hoists all the

loads and stores out of the inner loop. Anyway, I changed all the

invocations of second(dummy) to things like second(a), second(b), and

so forth.

Since we still don't have a machine (damn!), I've run things on our

simulator. It's supposed to accurately model everything though,

including packets hopping around the network, any collisions or

hotspots, etc. I'm assuming a clock rate of 300 MHz. Won't know that

until the hardware runs, but we're actually hoping for better (say

333MHz or so).

I did several runs, experimenting with various values of "n" and

different sizes of system. To get you warmed up, I've included

results for a one processor system. I'll get some more to you soon.

Looking at the code, each of the loops should peak out at 2400 MB/s

per processor. We're a little slower because of scalar overheads

(time to get all the parallel threads up and chugging). If the

simulator wasn't so slow, I expect I could scale "n" to bring the

achieved rate arbitrarily close to the asymptotic rate.

----------------------------------------------

Double precision appears to have 16 digits of accuracy

Assuming 8 bytes per DOUBLE PRECISION word

----------------------------------------------

Array size = 200000

Offset = 0

The total memory requirement is 4 MB

You are running each test 2 times

The *best* time for each test is used

----------------------------------------------------

Your clock granularity/precision appears to be 2 microseconds

The tests below will each take a time on the order

of 1457 microseconds

(= 729 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

----------------------------------------------------

WARNING: The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

----------------------------------------------------

Function Rate (MB/s) RMS time Min time Max time

Assignment: 2133.7553 0.0015 0.0015 0.0015

Scaling : 2190.8705 0.0015 0.0015 0.0015

Summing : 2202.6162 0.0022 0.0022 0.0022

SAXPYing : 2199.1884 0.0022 0.0022 0.0022

Sum of a is : 90000000.

Sum of b is : 18000000.

Sum of c is : 24000000.

The above is for one processor, compiled especially to run on one

processor. For more a more general version capable of running on many

processors, the overhead is a little higher. I did a few tests with

larger machines and the same work, but they made it look like the

machine scaled horribly. I'm redoing them now, slowly and painfully,

doubly the amount of work each time the number of processors doubles.

Here's my predictions. We'll see what actually happens. I'm not

actually sure we'll be able to simulate the larger cases, but it

hasn't crashed yet.

copy sum n

1 2,062 x1 2,090 x1 200,000

2 3,969 x1.92 4,026 x1.92 400,000

4 7,640 x3.70 7,740 x3.70 800,000

8 14,715 x7.13 14,964 x7.16 1,600,000

16 27,991 x13.57 28,993 x13.87 3,200,000

32 50,648 x24.56 53,504 x25.6 6,400,000

What do people do on the other parallel machines?

Do you care how big they make n?

Preston

**Next message:**Norbert Juffa: "STREAM results Pentium, 180 MHz"**Previous message:**Charles Grassl: "CRAY J932 stream results"**Next in thread:**jerry (j.) quinn: "STREAM results"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ]

*
This archive was generated by hypermail 2b29
: Tue Apr 18 2000 - 05:23:05 CDT
*