V2600 STREAM Results, dual cpu boards

From: Kirby Collins (kcollins@rsn.hp.com)
Date: Tue Feb 01 2000 - 16:29:47 CST


Attached are STREAM results for a single node V2600,
with dual (two cpu per runway) boards installed. In
this configuration a V2600 node supports up to 32
552MHz PA8600 processors.

The system had the following OS & compiler revs installed:

  B3901BA B.11.01.08 HP C/ANSI C Developer's Bundle for HP-UX 11.00 (S800)
  B3909CA B.11.01.06 HP FORTRAN Compiler and associated products (S800)
  B3909DB B.11.01.08 HP Fortran 90 Compiler and associated products (S800)
  XSWGR1100 B.11.00.47 General Release Patches, November 1999 (ACE)
  PHSS_18300 1.0 ANSI C compiler cumulative patch.
  PHSS_20206 1.0 Cumulative Fortran90 v2.3 Patch 1

I modified the fortran source to increase the size of the arrays, and to compile in
64-bit (wide) mode (note that HP systems use the LP64 model):

rodgers [104] diff stream_d.f stream_d.f.orig
53c53
< PARAMETER (n=20000000,offset=0,ndim=n+offset,ntimes=10)

---
>       PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
57,60c57
< CKLC changes for 64-bit
< CKLC      INTEGER j,k,nbpw,quantum
<       INTEGER j,k,quantum
<       INTEGER*8 nbpw
---
>       INTEGER j,k,nbpw,quantum
65c62
<       INTEGER*8 bytes(4)
---
>       INTEGER bytes(4)
81c78
<       COMMON a,b,c
---
> *     COMMON a,b,c
338a336
> 

f90 -o stream_d.mp +Oinfo +DA2.0W +DS2.0 +noppu +O3 +Odataprefetch +FPD -Wl,+pd,64M +Oparallel stream_d.f second_wall.o

Except for the 1 way run, I deconfigured cpu boards via firmware to the desired configuration and rebooted for each test. The one way run was performed with the full complement of 32 cpus configured, but the parallelism restricted by setting the MP_NUMBER_OF_THREADS environment

variable to 1.

====================================================================== ncpus=1 (32 cpus configured)

---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 3 microseconds The tests below will each take a time on the order of 525956 microseconds (= 175319 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 390.0983 0.9007 0.8203 0.9094 Scale: 389.5356 0.8784 0.8215 0.8848 Add: 371.6430 1.2932 1.2916 1.2950 Triad: 376.3773 1.2762 1.2753 1.2772 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18

=============================================================== ncpus=8 (8 cpus configured) ---------------------------------------------- Double precision appears to have 16 digits of accuracy

Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 3 microseconds The tests below will each take a time on the order of 132629 microseconds (= 44210 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 1539.8016 0.2210 0.2078 0.2546 Scale: 1547.6878 0.2187 0.2068 0.2382 Add: 1645.0469 0.2935 0.2918 0.2981 Triad: 1621.4903 0.2965 0.2960 0.2975 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18

======================================================== ncpus=16 (16 cpus configured) ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000

Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 2 microseconds The tests below will each take a time on the order of 66981 microseconds (= 33490 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 3056.2062 0.1134 0.1047 0.1552 Scale: 3081.3990 0.1131 0.1038 0.1493 Add: 3209.9086 0.1505 0.1495 0.1554 Triad: 3193.9745 0.1508 0.1503 0.1518 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18

=============================================================== ncpus=24 (24 cpus configured) ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 2 microseconds The tests below will each take a time on the order of 62135 microseconds (= 31067 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 3741.4224 0.1075 0.0855 0.1107 Scale: 3946.3325 0.1032 0.0811 0.1060 Add: 3304.5341 0.1467 0.1453 0.1498 Triad: 3283.5793 0.1476 0.1462 0.1498 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18

============================================================== ncpus=32 (32 cpus configured) ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times The *best* time for each test is used ---------------------------------------------------- Your clock granularity/precision appears to be 2 microseconds The tests below will each take a time on the order of 58336 microseconds (= 29168 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ---------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ---------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 3993.8620 0.1034 0.0801 0.1118 Scale: 4139.0222 0.1007 0.0773 0.1039 Add: 3543.9800 0.1432 0.1354 0.1872 Triad: 3898.2521 0.1312 0.1231 0.1811 Sum of a is = 2.306601562980265E+19 Sum of b is = 4.613203126278327E+18 Sum of c is = 6.150937498077968E+18



This archive was generated by hypermail 2b29 : Tue Apr 25 2000 - 01:49:24 CDT