V2600 STREAM Results, single cpu boards

From: Kirby Collins (kcollins@rsn.hp.com)
Date: Tue Feb 01 2000 - 16:36:05 CST


Attached are STREAM results for a single node V2600,
with single (one cpu per runway) boards installed. In
this configuration a V2600 node supports up to 16
552MHz PA8600 processors (this is commonly referred to
and sold as a "technical" configuration).

The system had the following OS & compiler revs installed:

  B3901BA B.11.01.08 HP C/ANSI C Developer's Bundle for HP-UX 11.00 (S800)
  B3909CA B.11.01.06 HP FORTRAN Compiler and associated products (S800)
  B3909DB B.11.01.08 HP Fortran 90 Compiler and associated products (S800)
  XSWGR1100 B.11.00.47 General Release Patches, November 1999 (ACE)
  PHSS_18300 1.0 ANSI C compiler cumulative patch.
  PHSS_20206 1.0 Cumulative Fortran90 v2.3 Patch 1

I modified the fortran source to increase the size of the arrays, and to compile in
64-bit (wide) mode (note that HP systems use the LP64 model):

rodgers [104] diff stream_d.f stream_d.f.orig
53c53
< PARAMETER (n=20000000,offset=0,ndim=n+offset,ntimes=10)

---
>       PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
57,60c57
< CKLC changes for 64-bit
< CKLC      INTEGER j,k,nbpw,quantum
<       INTEGER j,k,quantum
<       INTEGER*8 nbpw
---
>       INTEGER j,k,nbpw,quantum
65c62
<       INTEGER*8 bytes(4)
---
>       INTEGER bytes(4)
81c78
<       COMMON a,b,c
---
> *     COMMON a,b,c
338a336
> 

f90 -o stream_d.mp +Oinfo +DA2.0W +DS2.0 +noppu +O3 +Odataprefetch +FPD -Wl,+pd,64M +Oparallel stream_d.f second_wall.o

Except for the 1 way run, I deconfigured cpu boards via firmware to the desired configuration and rebooted for each test. The one way run was performed with the full complement of 16

cpus configured, but the parallelism restricted by setting the MP_NUMBER_OF_THREADS environment variable to 1.

============================================================== ncpus=1 (16 cpus configured) ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 3 microseconds The tests below will each take a time on the order of 525107 microseconds (= 175036 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 390.3334 0.9016 0.8198 0.9106 Scale: 389.9001 0.8548 0.8207 0.8591 Add: 370.2372 1.2978 1.2965 1.2992 Triad: 384.1730 1.2505 1.2494 1.2515 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18

============================================================== ncpus=4 (4 cpus configured) ----------------------------------------------

Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 3 microseconds The tests below will each take a time on the order of 131203 microseconds (= 43734 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 1358.5052 0.2372 0.2356 0.2473 Scale: 1458.0587 0.2287 0.2195 0.2916 Add: 1438.0558 0.3502 0.3338 0.4645 Triad: 1502.8371 0.3220 0.3194 0.3229 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18

============================================================= ncpus=8 (8 cpus configured) ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ----------------------------------------------

Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 2 microseconds The tests below will each take a time on the order of 66005 microseconds (= 33002 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 2977.6831 0.1411 0.1075 0.2352 Scale: 3004.9765 0.1198 0.1065 0.1253 Add: 2583.4783 0.1880 0.1858 0.1964 Triad: 2653.1797 0.1821 0.1809 0.1897 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18

============================================================= ncpus=16 (16 cpus configured) ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 2 microseconds The tests below will each take a time on the order of 54744 microseconds (= 27372 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 4093.6950 0.0913 0.0782 0.0929 Scale: 4209.3656 0.0894 0.0760 0.0911 Add: 3663.0876 0.1317 0.1310 0.1326 Triad: 3706.7641 0.1303 0.1295 0.1353 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18



This archive was generated by hypermail 2b29 : Tue Apr 25 2000 - 01:49:24 CDT