STREAM on Power3smp

From: Frank Johnston (fjohn@us.ibm.com)
Date: Mon Sep 27 1999 - 11:16:01 CDT


To whom it may concern,

I would like to submit STREAM benchmark results for an
IBM RS/6000 SP 222 MHz POWER3smp. Please add the following
summary to the web page of standard STREAM results:

Machine ID npus COPY SCALE ADD TRIAD
IBM-SP_222MHz-POWER3smp 1 790.4 794.1 951.0 953.3
IBM-SP_222MHz-POWER3smp 2 1549.2 1524.4 1843.5 1844.5
IBM-SP_222MHz-POWER3smp 4 2677.0 2630.9 3091.3 3101.2
IBM-SP_222MHz-POWER3smp 8 4332.9 4211.4 4843.3 4937.8

Each CPU on this machine has a private 4 MB L2 cache. The array
size at each CPU count exceeds the aggregate available L2. However
the arrays were kept as small as possible in order to give
system tables mapping virtual to real memory a chance to reside in
the L2. When data in these tables comes from main memory,
there is additional memory traffic which STREAM does not count.

In addition to changing the array size and offset, the following
changes were made to the Fortran STREAM code:

1) Increase "ntimes" to 1000 in order to reduce run-to-run fluctuations.

2) Change the "scalar" constant to sqrt(2) - 1 to prevent overflows (INFs).

3) Add compiler directives to allow the code to run in parallel on multiple
   CPUs and to allow certain arrays to be "prefetched" by the hardware.
   Here is an example (the COPY loop):

!smp$ parallel do private(j) , schedule(static)
          DO 30 j = 1,n
!ibm* prefetch_by_load( c(j) )
              c(j) = a(j)
   30 CONTINUE

If you have any questions, please let me know.

Frank Johnston
(fjohn@us.ibm.com, xxx-xxx-xxxx)

P.S. Here are the actual outputs:

output for 1 CPU:
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 524300
 Offset = 330
 The total memory requirement is 12 MB
 You are running each test 1000 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 5 microseconds
 The tests below will each take a time on the order
 of 8259 microseconds
    (= 1652 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 790.4472 .0107 .0106 .0116
Scale: 794.1492 .0106 .0106 .0109
Add: 951.0364 .0133 .0132 .0144
Triad: 953.3298 .0133 .0132 .0137
 Sum of a is = 1048600.00000001118
 Sum of b is = 434344.341503158561
 Sum of c is = 1482944.34150735638

output for 2 CPUs:
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 1048600
 Offset = 404
 The total memory requirement is 24 MB
 You are running each test 1000 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 5 microseconds
 The tests below will each take a time on the order
 of 15856 microseconds
    (= 3171 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 1549.2400 .0109 .0108 .0201
Scale: 1524.3722 .0111 .0110 .0120
Add: 1843.4750 .0137 .0137 .0147
Triad: 1844.5380 .0137 .0136 .0143
 Sum of a is = 2097200.00000001118
 Sum of b is = 868688.683004021645
 Sum of c is = 2965888.68302347837

output for 4 CPUs:
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2097200
 Offset = 260
 The total memory requirement is 48 MB
 You are running each test 1000 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 5 microseconds
 The tests below will each take a time on the order
 of 32019 microseconds
    (= 6404 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 2676.9512 .0133 .0125 .0329
Scale: 2630.9133 .0133 .0128 .0229
Add: 3091.3001 .0169 .0163 .0181
Triad: 3101.2452 .0168 .0162 .0180
 Sum of a is = 4194400.00000001118
 Sum of b is = 1737377.36602994637
 Sum of c is = 5931777.36605572235

output for 8 CPUs:
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 4194400
 Offset = 468
 The total memory requirement is 96 MB
 You are running each test 1000 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 5 microseconds
 The tests below will each take a time on the order
 of 64196 microseconds
    (= 12839 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 4332.9165 .0166 .0155 .0576
Scale: 4211.3665 .0170 .0159 .0588
Add: 4843.2744 .0217 .0208 .0607
Triad: 4937.7796 .0213 .0204 .0486
 Sum of a is = 8388800.00000001118
 Sum of b is = 3474754.73209443502
 Sum of c is = 11863554.7321202103



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:08 CDT