STREAM results for Power2 "Thin2"

From: ce107@cfm.brown.edu
Date: Mon Apr 22 1996 - 01:53:43 CDT

Next message: ce107@cfm.brown.edu: "Re: STREAM results for Power2 "Thin2""
Previous message: <: "some STREAM results"
Next in thread: ce107@cfm.brown.edu: "Re: STREAM results for Power2 "Thin2""
Maybe reply: ce107@cfm.brown.edu: "Re: STREAM results for Power2 "Thin2""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

As usual this is a big one. :-( I thought I had already sent you the
results but today I realised that was not so.
Constantinos

The AIX timer is not very accurate and given the speed of the machine,
even with vectors 4400000 elements long STREAM estimates that I'll be
getting less than 20 clock ticks per test - the tests themselves
though show just a bit more than 20, so I kept them, as the 128MB per
node can't really handle much more. However the results displayed too
much variability even when I requested 200 (ntimes) runs instead of 10.
Here are the best cases I got:

For the C compiler:

cc -O3 -qarch=pwr2 -qtune=pwr2 second.c stream_d.c -o stream_O3pwrx -lm
second.c:
stream_d.c:
esp28:ce107/Benchmarking/stream% stream_O3pwrx -------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 4400000, Offset = 0
Total memory required = 100.7 MB.
Each test is run 200 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 159999 microseconds.
(= 16 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 306.0870 0.2453 0.2300 0.2700
Scaling : 293.3333 0.2449 0.2400 0.2600
Summing : 340.6452 0.3269 0.3100 0.3500
SAXPYing : 340.6452 0.3272 0.3100 0.3500
230.080u 0.170s 3:51.69 99.3% 15+-83638k 0+0io 4pf+0w

Another run:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 306.0870 0.2446 0.2300 0.2700
Scaling : 306.0870 0.2445 0.2300 0.2700
Summing : 340.6452 0.3268 0.3100 0.3500
SAXPYing : 330.0000 0.3278 0.3200 0.3500
230.010u 0.160s 3:51.59 99.3% 15+-83709k 0+0io 5pf+0w

Yet another run:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 306.0870 0.2455 0.2300 0.2700
Scaling : 306.0870 0.2437 0.2300 0.2700
Summing : 352.0000 0.3273 0.3000 0.3500
SAXPYing : 340.6452 0.3272 0.3100 0.3500
229.980u 0.140s 3:51.58 99.3% 15+-83740k 0+0io 4pf+0w

One last run:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 320.0000 0.2448 0.2200 0.2700
Scaling : 293.3333 0.2447 0.2400 0.2600
Summing : 340.6452 0.3270 0.3100 0.3500
SAXPYing : 340.6452 0.3281 0.3100 0.3500
230.130u 0.170s 3:51.66 99.4% 15+-83601k 0+0io 4pf+0w

For the Fortran compiler using -qtune=pwr2s (in case you've been
working with an older version of the xlf compiler, from the xlf man page:
                          pwr2s - produces an object optimized for a
                                       subset of POWER2 hardware platforms,
                                       the desktop models with narrow memory
                                       bandwidth.)
It is not clear whether "narrow" refers to "Thin" or "Thin2" nodes, but
given that xlf qtune=pwr2 option was probably designed with the 590
Power2 in mind it is interesting to see what this option does for the
59H (SP2 Thin2 Node in our case) Power2:

esp28:ce107/Benchmarking/stream% xlf -O3 -qarch=pwr2 -qtune=pwr2s stream_d.f -o stream_O3pwr2s
"stream_d.f", 1500-036 (I) Optimization level 3 has the potential to alter the semantics of a program. Please refer to documentation on -O3 and the STRICT option for more information.
** stream === End of Compilation 1 ===
** second === End of Compilation 2 ===
** realsize === End of Compilation 3 ===
** confuse === End of Compilation 4 ===
** checktick === End of Compilation 5 ===
1501-510 Compilation successful for file stream_d.f.
esp28:ce107/Benchmarking/stream% stream_O3pwr2s ----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 4400000
Offset = 0
The total memory requirement is 100 MB
You are running each test 200 times
The *best* time for each test is used
----------------------------------------------------
Your clock granularity/precision appears to be 10000 microseconds
The tests below will each take a time on the order
of 220000 microseconds
(= 22 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
----------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 306.0870 .2469 .2300 .2700
Scaling : 306.0870 .2467 .2300 .2700
Summing : 330.0000 .3309 .3200 .3500
SAXPYing : 330.0000 .3312 .3200 .3500
Sum of a is : 0.145456952139311120E+243
Sum of b is : 0.290913904286721377E+242
Sum of c is : 0.387885205746326088E+242
233.470u 0.250s 4:09.15 93.8% 15+-80876k 0+0io 625pf+0w

Another run:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 306.0870 .2462 .2300 .2700
Scaling : 306.0870 .2478 .2300 .2700
Summing : 340.6452 .3306 .3100 .3500
SAXPYing : 330.0000 .3308 .3200 .3500
Sum of a is : 0.145456952139311120E+243
Sum of b is : 0.290913904286721377E+242
Sum of c is : 0.387885205746326088E+242
233.200u 0.230s 3:55.11 99.2% 15+-81085k 0+0io 9pf+0w

Yet another run:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 306.0870 .2458 .2300 .2700
Scaling : 293.3333 .2473 .2400 .2700
Summing : 352.0000 .3313 .3000 .3500
SAXPYing : 330.0000 .3315 .3200 .3500
Sum of a is : 0.145456952139311120E+243
Sum of b is : 0.290913904286721377E+242
Sum of c is : 0.387885205746326088E+242
233.280u 0.420s 3:55.05 99.4% 15+-80951k 0+0io 4pf+0w

Finally using the Fortran compiler and -qtune=pwr2:

esp28:ce107/Benchmarking/stream% xlf -O3 -qarch=pwr2 -qtune=pwr2 stream_d.f -o stream_O3pwr2
"stream_d.f", 1500-036 (I) Optimization level 3 has the potential to alter the semantics of a program. Please refer to documentation on -O3 and the STRICT option for more information.
** stream === End of Compilation 1 ===
** second === End of Compilation 2 ===
** realsize === End of Compilation 3 ===
** confuse === End of Compilation 4 ===
** checktick === End of Compilation 5 ===
1501-510 Compilation successful for file stream_d.f.
esp28:ce107/Benchmarking/stream% stream_O3pwr2
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 4400000
Offset = 0
The total memory requirement is 100 MB
You are running each test 200 times
The *best* time for each test is used
----------------------------------------------------
Your clock granularity/precision appears to be 10000 microseconds
The tests below will each take a time on the order
of 160000 microseconds
(= 16 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
----------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 320.0000 .2448 .2200 .2600
Scaling : 320.0000 .2315 .2200 .2500
Summing : 352.0000 .3265 .3000 .3500
SAXPYing : 340.6452 .3279 .3100 .3500
Sum of a is : 0.145456952139311120E+243
Sum of b is : 0.290913904286721377E+242
Sum of c is : 0.387885205746326088E+242
228.190u 0.280s 3:49.95 99.3% 18+-85102k 0+0io 9pf+0w

Another run:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 306.0870 .2449 .2300 .2700
Scaling : 320.0000 .2322 .2200 .2500
Summing : 340.6452 .3271 .3100 .3500
SAXPYing : 352.0000 .3278 .3000 .3500
Sum of a is : 0.145456952139311120E+243
Sum of b is : 0.290913904286721377E+242
Sum of c is : 0.387885205746326088E+242
228.420u 0.260s 3:50.01 99.4% 19+-84913k 0+0io 8pf+0w

Yet another run:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 293.3333 .2452 .2400 .2600
Scaling : 320.0000 .2315 .2200 .2500
Summing : 352.0000 .3269 .3000 .3500
SAXPYing : 340.6452 .3279 .3100 .3500
Sum of a is : 0.145456952139311120E+243
Sum of b is : 0.290913904286721377E+242
Sum of c is : 0.387885205746326088E+242
228.300u 0.380s 3:50.03 99.4% 20+-84967k 0+0io 8pf+0w

Given that I was running on an unloaded node I decided to
try my luck with the high resolution timer though that underestimates
performance as it measures wallclock time:

So here are the results, for the C and the Fortran compiler. I
replaced calls to second() by calls to hires() in the C source.

esp28:ce107/Benchmarking/stream% xlc -O3 -qarch=pwrx -qtune=pwrx stream_d.c timer.o hires.o -o stream_xlcO3pwrx -lm

esp28:ce107/Benchmarking/stream% time stream_xlcO3pwrx
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 73586 microseconds.
(= 73586 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 288.7229 0.1120 0.1108 0.1205
Scaling : 288.8919 0.1118 0.1108 0.1202
Summing : 323.7298 0.1504 0.1483 0.1578
SAXPYing : 323.8840 0.1513 0.1482 0.1673
5.800u 0.010s 0:06.00 96.8% 15+44780k 0+0io 2pf+0w

xlf -O3 -qarch=pwr2 -qtune=pwr2 stream_d.f timer.o hires.o -o
stream_xlfO3pwr2
esp28:ce107/Benchmarking/stream% time stream_xlfO3pwr2
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 200000
Offset = 0
The total memory requirement is 4 MB
You are running each test 10 times
The *best* time for each test is used
----------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds
The tests below will each take a time on the order
of 7343 microseconds
(= 7343 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
----------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 290.0625 .0111 .0110 .0114
Scaling : 304.6318 .0105 .0105 .0107
Summing : 323.0374 .0149 .0149 .0151
SAXPYing : 323.7074 .0149 .0148 .0150
Sum of a is : 0.230660156249631264E+18
Sum of b is : 46132031249756200.0
Sum of c is : 61509375000000000.0
0.620u 0.000s 0:00.65 95.3% 19+4538k 0+0io 4pf+0w

xlf -O3 -qarch=pwr2 -qtune=pwr2s stream_d.f timer.o hires.o -o
stream_xlfO3pwr2s
esp28:ce107/Benchmarking/stream% time stream_xlfO3pwr2s
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
Array size = 200000
Offset = 0
The total memory requirement is 4 MB
You are running each test 10 times
The *best* time for each test is used
----------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds
The tests below will each take a time on the order
of 7398 microseconds
(= 7398 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
----------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 288.3705 .0111 .0111 .0112
Scaling : 285.8251 .0113 .0112 .0118
Summing : 316.4269 .0165 .0152 .0244
SAXPYing : 320.1631 .0150 .0150 .0152
Sum of a is : 0.230660156249631264E+18
Sum of b is : 46132031249756200.0
Sum of c is : 61509375000000000.0
0.620u 0.010s 0:00.67 94.0% 15+4544k 0+0io 4pf+0w

Next message: ce107@cfm.brown.edu: "Re: STREAM results for Power2 "Thin2""
Previous message: <: "some STREAM results"
Next in thread: ce107@cfm.brown.edu: "Re: STREAM results for Power2 "Thin2""
Maybe reply: ce107@cfm.brown.edu: "Re: STREAM results for Power2 "Thin2""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:05 CDT