[stream] HP Integrity Superdome autoparallel Stream results

From: Kirby L. Collins <kcollins@rsn.hp.com>
Date: Tue Mar 30 2004 - 14:07:33 CST

Stream results for HP Integrity Superdome with 1500MHz,
6MB L3 cache Itanium2 processors:

16 cells, 64 1500MHz cpus, 256GB of memory (512x512MB DIMMs):

Function Rate (MB/s) Avg time Min time Max time
Copy: 82276.4205 0.0499 0.0498 0.0500
Scale: 81269.1622 0.0508 0.0504 0.0527
Add: 83036.8965 0.0748 0.0740 0.0767
Triad: 84048.9669 0.0733 0.0731 0.0735

16 cells, 32 1500MHz cpus, 256GB of memory (512x512MB DIMMs):

Function Rate (MB/s) Avg time Min time Max time
Copy: 81855.4008 0.0502 0.0500 0.0503
Scale: 80862.9293 0.0509 0.0507 0.0512
Add: 82734.0532 0.0753 0.0743 0.0764
Triad: 82352.5381 0.0750 0.0746 0.0758

8 cells, 32 1500MHz cpus, 128GB of memory (256x512MB DIMMs):

Function Rate (MB/s) Avg time Min time Max time
Copy: 41381.6738 0.0991 0.0990 0.0992
Scale: 40900.3209 0.1005 0.1001 0.1012
Add: 41559.1577 0.1485 0.1478 0.1491
Triad: 42134.5581 0.1461 0.1458 0.1466

8 cells, 16 1500MHz cpus, 128GB of memory (256x512MB DIMMs):

Function Rate (MB/s) Avg time Min time Max time
Copy: 40971.0368 0.1017 0.1000 0.1078
Scale: 40471.5630 0.1032 0.1012 0.1092
Add: 41149.4431 0.1530 0.1493 0.1626
Triad: 41142.8734 0.1509 0.1493 0.1560

4 cells, 16 1500MHz cpus, 64GB of memory (128x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 20733.0610 0.1984 0.1976 0.2004
Scale: 20480.8295 0.2007 0.2000 0.2021
Add: 20786.1156 0.2966 0.2956 0.2990
Triad: 21052.8563 0.2928 0.2918 0.2951

The half populated configurations (2 cpus per cell) are fully
supported and orderable from HP.

The system was booted with half of the memory in each cell
configured as local memory. The system was running the HP-UX
11i v2 (11.23) TCOE. A patch (PHKL_30089) was installed which
improves performance of 64 way pthreaded applications by
approximately 10%.

The f90 version of the stream benchmark was compiled auto-parallel, with
the following changes (mysecond.c is a C routine that calls gettimeofday):

diff ../src/stream.ORIG/stream_d.f stream_d.f
63c63
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)

---
>       PARAMETER (n=256002800,offset=0,ndim=n+offset,ntimes=10)
72c72
<       INTEGER bytes(4)
---
>       INTEGER*8 bytes(4)
90c90
< *     COMMON a,b,c
---
>       COMMON a,b,c
compiled as follows:
        cc +DSitanium2 +DD64 +O3 +Odataprefetch -Wl,+pd,64M -c mysecond.c
        f90 -o stream_d.mp +Ofaster +DSitanium2 -Wl,+pd,1M +DD64 +Oautopar +Onoopenmp +autodbl4 +extend_source +no
ppu stream_d.f mysecond.o
By default memory was allocted from local memory on a first-touch basis 
(setting the page size hint to 1MB via the +pd 1M linker option produces 
a better match between the granularity of allocation and the chunk size 
each thread works on).  Here are the outputs for each configuration:
16 cells, 64 processors, 256 GB of memory (512x512MB DIMMs):
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =  256002800
 Offset     =          0
 The total memory requirement is 5859 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      82276.4205      0.0499      0.0498      0.0500
Scale:     81269.1622      0.0508      0.0504      0.0527
Add:       83036.8965      0.0748      0.0740      0.0767
Triad:     84048.9669      0.0733      0.0731      0.0735
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
16 cells, 32 processors, 256 GB of memory (512x512MB DIMMs):
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =  256002800
 Offset     =          0
 The total memory requirement is 5859 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      81855.4008      0.0502      0.0500      0.0503  
Scale:     80862.9293      0.0509      0.0507      0.0512  
Add:       82734.0532      0.0753      0.0743      0.0764  
Triad:     82352.5381      0.0750      0.0746      0.0758  
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
8 cells, 32 processors, 128 GB of memory (256x512MB DIMMs):
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =  256002800
 Offset     =          0
 The total memory requirement is 5859 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      41381.6738      0.0991      0.0990      0.0992
Scale:     40900.3209      0.1005      0.1001      0.1012
Add:       41559.1577      0.1485      0.1478      0.1491
Triad:     42134.5581      0.1461      0.1458      0.1466
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
8 cells, 16 processors, 128 GB of memory (256x512MB DIMMs):
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =  256002800
 Offset     =          0
 The total memory requirement is 5859 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      40971.0368      0.1017      0.1000      0.1078
Scale:     40471.5630      0.1032      0.1012      0.1092
Add:       41149.4431      0.1530      0.1493      0.1626
Triad:     41142.8734      0.1509      0.1493      0.1560
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
4 cells, 16 processors, 64 GB of memory (128x512MB DIMMs):
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =  256002800
 Offset     =          0
 The total memory requirement is 5859 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      20733.0610      0.1984      0.1976      0.2004
Scale:     20480.8295      0.2007      0.2000      0.2021
Add:       20786.1156      0.2966      0.2956      0.2990
Triad:     21052.8563      0.2928      0.2918      0.2951
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
Received on Tue Mar 30 14:07:33 2004

This archive was generated by hypermail 2.1.8 : Sat Apr 03 2004 - 14:56:56 CST