STREAM results for Ultra60/360 (2 cpus)

From: ce107@cfm.brown.edu
Date: Tue Oct 20 1998 - 01:23:40 CDT


Using etime() and the Fortran source, version 4.2 of the Sun compilers
(the 5.0 ones implement the prefetch instructions for Ultra2 so better
performance is to be expected):

For the serial code I compiled using:
f77 -fast -xO4 -xdepend -fsimple=2 -xunroll=8 -xarch=v8plusa \
-o stream.shiva stream_d.f
shiva:ce107/benchmark/stream% time stream.shiva
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 11 microseconds
 The tests below will each take a time on the order
 of 74281 microseconds
    (= 6753 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 355.1692 0.0906 0.0901 0.0919
Scaling : 343.8042 0.0934 0.0931 0.0937
Summing : 311.1300 0.1540 0.1543 0.1551
SAXPYing : 358.5744 0.1345 0.1339 0.1355
 Sum of a is : 2.3066015625919D+18
 Sum of b is : 4.6132031248564D+17
 Sum of c is : 6.1509375001413D+17
 Note: Nonstandard floating-point mode enabled
 See the Numerical Computation Guide, ieee_sun(3M)
5.25u 0.22s 0:05.53 98.9%

For the autoparallelized code I compiled using:
f77 -autopar -stackvar -fast -xO4 -xdepend -fsimple=2 -xunroll=8 \
-xarch=v8plusa -o stream.shiva.p stream_d.f
I needed to increase the stack limit accordingly.

shiva:ce107/benchmark/stream% setenv PARALLEL 1
shiva:ce107/benchmark/stream% time stream.shiva.p
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be 1 microseconds
 The tests below will each take a time on the order
 of 91866 microseconds
    (= 91866 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 297.1950 0.1290 0.1077 0.1523
Scaling : 341.5131 0.1209 0.0937 0.1471
Summing : 290.5452 0.2051 0.1652 0.2360
SAXPYing : 271.9637 0.2047 0.1765 0.2272
 Sum of a is : 2.3066015625919D+18
 Sum of b is : 4.6132031248564D+17
 Sum of c is : 6.1509375001413D+17
 Note: Nonstandard floating-point mode enabled
 See the Numerical Computation Guide, ieee_sun(3M)
6.94u 0.22s 0:07.22 99.1%
shiva:ce107/benchmark/stream% setenv PARALLEL 2
shiva:ce107/benchmark/stream% time stream.shiva.p
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be 1 microseconds
 The tests below will each take a time on the order
 of 94030 microseconds
    (= 94030 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 340.7895 0.0952 0.0939 0.0962
Scaling : 392.5442 0.0822 0.0815 0.0830
Summing : 371.5010 0.1396 0.1292 0.1443
SAXPYing : 395.7602 0.1244 0.1213 0.1256
 Sum of a is : 2.3066015625919D+18
 Sum of b is : 4.6132031248564D+17
 Sum of c is : 6.1509375001413D+17
 Note: Nonstandard floating-point mode enabled
 See the Numerical Computation Guide, ieee_sun(3M)
9.73u 0.21s 0:05.20 191.1%

Compiling without the -stackvar option:
f77 -autopar -fast -xO4 -xdepend -fsimple=2 -xunroll=8 \
-xarch=v8plusa -o stream.shiva.pns stream_d.f

shiva:ce107/benchmark/stream% setenv PARALLEL 1
shiva:ce107/benchmark/stream% time stream.shiva.pns
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be 1 microseconds
 The tests below will each take a time on the order
 of 91920 microseconds
    (= 91920 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 343.8879 0.0944 0.0931 0.0957
Scaling : 287.9004 0.1116 0.1111 0.1123
Summing : 352.7692 0.1380 0.1361 0.1388
SAXPYing : 269.7542 0.1789 0.1779 0.1820
 Sum of a is : 2.3066015625919D+18
 Sum of b is : 4.6132031248564D+17
 Sum of c is : 6.1509375001413D+17
 Note: Nonstandard floating-point mode enabled
 See the Numerical Computation Guide, ieee_sun(3M)
5.76u 0.23s 0:06.06 98.8%
shiva:ce107/benchmark/stream% setenv PARALLEL 2
shiva:ce107/benchmark/stream% time stream.shiva.pns
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity appears to be less than one microsecond
 Your clock granularity/precision appears to be 1 microseconds
 The tests below will each take a time on the order
 of 58727 microseconds
    (= 58727 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING: The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 391.2734 0.0828 0.0818 0.0859
Scaling : 324.4294 0.0989 0.0986 0.0994
Summing : 428.6842 0.1123 0.1120 0.1130
SAXPYing : 374.8837 0.1410 0.1280 0.1433
 Sum of a is : 2.3066015625919D+18
 Sum of b is : 4.6132031248564D+17
 Sum of c is : 6.1509375001413D+17
 Note: Nonstandard floating-point mode enabled
 See the Numerical Computation Guide, ieee_sun(3M)
9.39u 0.51s 0:05.03 196.8%

The combination of -autopar and -stackvar seem to influence some
kernels more than others, in the case of summation with -autopar only
even beneficially for only one processor when compared to the serial
code.

For the F90 version of the code using etime again()
f90 -fast -xO4 -xarch=v8plusa -o stream.shiva.90 stream90.f90

shiva:ce107/benchmark/stream% time stream.shiva.90
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 2000000
 Offset = 0
 The total memory requirement is 45 MB
 You are running each test 10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be 11 microseconds
 The tests below will each take a time on the order
 of 95057 microseconds
    (= 8642 clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function Rate (MB/s) Min time Max time Mean time RMS time Median
Copy: 333.37 0.0960 0.0997 0.0971 0.0971 0.0964
Scale: 287.27 0.1114 0.1134 0.1121 0.1121 0.1126
Add: 333.87 0.1438 0.1453 0.1445 0.1445 0.1440
Triad: 241.46 0.1988 0.2032 0.2001 0.2001 0.1997
-------------------------------------------------------------------------------
 All times are
    0.0997 0.1124 0.1452 0.2021
    0.0968 0.1119 0.1443 0.2032
    0.0967 0.1124 0.1441 0.1988
    0.0965 0.1116 0.1441 0.1995
    0.0960 0.1134 0.1438 0.1992
    0.0969 0.1118 0.1443 0.2002
    0.0964 0.1114 0.1441 0.1989
    0.0980 0.1123 0.1453 0.1988
    0.0963 0.1119 0.1448 0.2015
    0.0975 0.1124 0.1453 0.1988
-------------------------------------------------------------------------------
 Sum of a is = 115330078125.
 Sum of b is = 23066015625.
 Sum of c is = 30754687500.
6.20u 0.32s 0:06.58 99.0%

Compiling with -parallel(/-stackvar) resulted in code that would not
print anything apart from the first line of "---" :-).

Constantinos Evangelinos
Center for Fluid Mechanics
Brown University/Division of Applied Mathematics



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:07 CDT