Re: SPP-1600 STREAM results

From: C. Evangelinos (ce107@cfm.brown.edu)
Date: Mon Jun 03 1996 - 11:30:04 CDT


Sorry for the delayed results but I've had too many other things to do
lately. Thank you for your suggestions - the 1 cpu results for the
SPP-1200 are now slightly better, plus I was able to run your code and
get decent parallel results for it. It would still be interestiung to
see what the auto-parallelizer can do with STREAM on the SPP-1600
because it certainly wasn't doing any well on the SPP-1200 when I
tested it. I was unable to run on more than a hypernode at NCSA, my
job output from running on the 16-node dedicated_GSM queue
consistently was like:

Warning: no access to tty (Bad file number).
Thus no job control in this shell.
login in progress...
[snip]
Interactive Scratch dir is /scr-int/cevangel.
Link /u/ac/cevangel/scr to /scr-int/cevangel exists
Batch Scratch dir is /scr-dedicated_GSM/cevangel.
cd benchmark/stream
mpa par_stream_d.O4 1
 
Queue : dedicated_GSM Host : lena
Started : Mon May 20 19:33:31 1996
Completed : Mon May 20 21:36:12 1996
User : 0.340000 secs System : 2.230000 secs

It seems as if the program never actually run but was running for more
than 2 hours according to the NQS queue manager (hence it was killed).
What could be the reason for this I have no idea.

I tested both +O2 and +O4 optimizations:
stream_d.02 (serial version compiled with your makefile)
stream_d.O4 (serial version compiled with your makefile adjusted so
             that the optimization flags are +OP4 +OPK +Oall +O4 +DA1.1
             and -lvec is also needed for linking)
par_stream_d.O2 (parallel version compiled with your makefile)
called as:
mpa -DATA par_stream_d.O2 <# of processors>
par_stream_d.O4 (parallel version compiled with your makefile adjusted
                 so that the optimization flags are +OP4 +OPK +Oall +O4 +DA1.1
                 and -lvec is also needed for linking)
called as:
mpa -DATA par_stream_d.O4 <# of processors>

It is interesting the serial difference in performance between the
SPP-1200 and -1600 diminishes in the parallel version as the processor
number nears 8 for COPY and SCALE.

stream_d.O2
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 10000 microseconds
 The tests below will each take a time on the order
 of 450000 microseconds
    (= 45 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 78.0488 .4122 .4100 .4200
Scaling : 76.1905 .4211 .4200 .4300
Summing : 88.8889 .5489 .5400 .5500
SAXPYing : 94.1176 .5223 .5100 .5400
 Sum of a is : 2.306601562591874E+18
 Sum of b is : 4.613203124856438E+17
 Sum of c is : 6.150937500141255E+17
19.9u 2.7s 0:30 75%
stream_d.O4
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 10000 microseconds
 The tests below will each take a time on the order
 of 430000 microseconds
    (= 43 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 80.0000 .4056 .4000 .4200
Scaling : 80.0000 .4112 .4000 .4200
Summing : 96.0000 .5078 .5000 .5200
SAXPYing : 94.1176 .5134 .5100 .5200
 Sum of a is : 2.306601562560083E+18
 Sum of b is : 4.613203125073140E+17
 Sum of c is : 6.150937500122511E+17
19.5u 2.7s 0:29 76%
mpa -DATA par_stream_d.O2 1
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6047973 microseconds.
   (= 335998 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 1 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 76.3495 5.2499 5.2391 5.3038
Scaling : 74.6323 5.3628 5.3596 5.3655
Summing : 93.1901 6.4394 6.4384 6.4409
SAXPYing : 93.0454 6.4531 6.4485 6.4645
mpa -DATA par_stream_d.O2 2
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6049826 microseconds.
   (= 336101 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 2 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 150.1820 2.6706 2.6634 2.6765
Scaling : 146.6131 2.7316 2.7283 2.7396
Summing : 179.5049 3.3565 3.3425 3.4011
SAXPYing : 178.9275 3.3609 3.3533 3.3690
mpa -DATA par_stream_d.O2 3
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6082363 microseconds.
   (= 337909 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 3 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 207.2189 1.9343 1.9303 1.9406
Scaling : 204.4633 1.9607 1.9563 1.9665
Summing : 237.7136 2.5287 2.5240 2.5346
SAXPYing : 235.6349 2.5513 2.5463 2.5555
mpa -DATA par_stream_d.O2 4
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6062518 microseconds.
   (= 336806 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 4 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 262.4959 1.5268 1.5238 1.5325
Scaling : 256.7633 1.5633 1.5579 1.5853
Summing : 298.1421 2.0174 2.0125 2.0219
SAXPYing : 297.0682 2.0229 2.0197 2.0293
mpa -DATA par_stream_d.O2 5
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6047701 microseconds.
   (= 335983 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 5 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 309.2101 1.2976 1.2936 1.3095
Scaling : 301.8488 1.3265 1.3252 1.3275
Summing : 350.2363 1.7166 1.7131 1.7203
SAXPYing : 349.7680 1.7186 1.7154 1.7217
mpa -DATA par_stream_d.O2 6
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6047389 microseconds.
   (= 335966 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 6 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 354.5204 1.1299 1.1283 1.1309
Scaling : 346.0609 1.1579 1.1559 1.1612
Summing : 397.7894 1.5116 1.5083 1.5146
SAXPYing : 395.9419 1.5175 1.5154 1.5205
mpa -DATA par_stream_d.O2 7
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6044445 microseconds.
   (= 335802 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 7 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 392.5795 1.0205 1.0189 1.0236
Scaling : 383.9788 1.0439 1.0417 1.0484
Summing : 434.7946 1.3826 1.3800 1.3892
SAXPYing : 433.8762 1.3850 1.3829 1.3882
mpa -DATA par_stream_d.O2 8
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6042724 microseconds.
   (= 335706 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 8 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 465.3760 0.9393 0.8595 0.9619
Scaling : 417.6547 0.9611 0.9577 0.9747
Summing : 468.5084 1.2847 1.2807 1.2987
SAXPYing : 466.6072 1.2892 1.2859 1.2974
mpa -DATA par_stream_d.O4 1
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6049023 microseconds.
   (= 336056 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 1 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 76.3216 5.2494 5.2410 5.3079
Scaling : 74.2698 5.3953 5.3858 5.3974
Summing : 92.5786 6.4878 6.4810 6.5279
SAXPYing : 93.0240 6.4567 6.4500 6.4870
mpa -DATA par_stream_d.O4 2
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6048816 microseconds.
   (= 336045 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 2 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 143.2986 2.8006 2.7914 2.8309
Scaling : 140.3433 2.8624 2.8502 2.8906
Summing : 165.8798 3.6332 3.6171 3.7076
SAXPYing : 165.8129 3.6284 3.6185 3.6511
mpa -DATA par_stream_d.O4 3
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6051061 microseconds.
   (= 336170 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 3 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 215.0631 1.8631 1.8599 1.8709
Scaling : 208.5560 1.9227 1.9180 1.9301
Summing : 248.1001 2.4250 2.4184 2.4323
SAXPYing : 249.5010 2.4092 2.4048 2.4145
mpa -DATA par_stream_d.O4 4
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6042531 microseconds.
   (= 335696 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 4 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 263.8708 1.5195 1.5159 1.5242
Scaling : 257.5287 1.5556 1.5532 1.5605
Summing : 296.3267 2.0290 2.0248 2.0338
SAXPYing : 297.6323 2.0195 2.0159 2.0258
mpa -DATA par_stream_d.O4 5
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6049244 microseconds.
   (= 336069 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 5 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 311.1584 1.2888 1.2855 1.2928
Scaling : 303.0067 1.3241 1.3201 1.3370
Summing : 347.4138 1.7294 1.7270 1.7334
SAXPYing : 348.8248 1.7221 1.7201 1.7254
mpa -DATA par_stream_d.O4 6
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6051047 microseconds.
   (= 336169 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 6 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 354.5166 1.1310 1.1283 1.1353
Scaling : 345.7863 1.1590 1.1568 1.1612
Summing : 395.7076 1.5169 1.5163 1.5182
SAXPYing : 396.3337 1.5175 1.5139 1.5207
mpa -DATA par_stream_d.O4 7
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6049326 microseconds.
   (= 336073 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 7 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 394.8512 1.0146 1.0130 1.0181
Scaling : 385.5362 1.0401 1.0375 1.0427
Summing : 433.6404 1.3846 1.3836 1.3874
SAXPYing : 434.7171 1.3820 1.3802 1.3852
mpa -DATA par_stream_d.O4 8
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 18 microseconds.
Each test below will take on the order of 6057808 microseconds.
   (= 336544 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 8 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 425.5767 0.9430 0.9399 0.9540
Scaling : 418.1402 0.9596 0.9566 0.9720
Summing : 469.0193 1.2842 1.2793 1.2920
SAXPYing : 468.6036 1.2843 1.2804 1.2928

It is also interesting that the aggregate system time for this
benchmarking run is 11% of the user time.

Queue : dedicated_8 Host : lena
Sequence : 14643 Remote Host : lena.ncsa.uiuc.edu
Submitted : Mon May 20 02:08:17 1996
Started : Mon May 20 06:34:11 1996
Completed : Mon May 20 07:12:00 1996
User : 4863.580000 secs System : 552.150000 secs
Constantinos Evangelinos
Center for Fluid Mechanics
Brown University/Division of Applied Mathematics



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:06 CDT