[stream] Stream results for HP 9000 Superdome

From: Kirby L. Collins <kcollins@rsn.hp.com>
Date: Tue Feb 10 2004 - 16:22:08 CST

Stream results for HP 9000 Superdome with 1000MHz dual processor PA-8800
modules with 32MB external cache (shared by each processor pair):

System Copy Scale Add Triad
HP 9000 Superdome PA-8800 8 cells, 64 cpus 14662 14927 15727 15839
HP 9000 Superdome PA-8800 16 cells, 64 cpus 29028 27113 30289 30560

The 16 cell configuration with 2 PA-8800 modules per cell
(half populated) is orderable and fully supported by HP.

The system was running HP-UX 11i TCOE (December 2003), with
all memory interleaved across the cells.

The f90 version of the stream benchmark was compiled auto-parallel, with
the following changes (mysecond.c is a C routine that calls gettimeofday):

63c63
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)

---
>       PARAMETER (n=576002248,offset=0,ndim=n+offset,ntimes=10)
72c72
<       INTEGER bytes(4)
---
>       INTEGER*8 bytes(4)
90c90
< *     COMMON a,b,c
---
>       COMMON a,b,c
200c200
<  9020 FORMAT (1x,a,i4,a)
---
>  9020 FORMAT (1x,a,i8,a)
	rm -f *.o stream_d.mp stream_d.uni stream_c.mp stream_c.uni
	cc +DD64 +O3 -c mysecond.c
	f90 -o stream_d.mp +Ofaster -Wl,+pd,1M +DD64 +Oautopar +Onoopenmp +autodbl4 +extend_source +noppu stream_d.f mysecond.o
output for 8 cells, 64 processors, 128GB (256x512MB DIMMs):
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =  576002248
 Offset     =          0
 The total memory requirement is    13183 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      2 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      14662.0459      0.6509      0.6286      0.6634  
Scale:     14927.3408      0.6364      0.6174      0.6789  
Add:       15727.4002      0.9066      0.8790      0.9297  
Triad:     15839.2678      0.8995      0.8728      0.9112  
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
output for 16 cells, 64 processors, 256GB (512x512MB DIMMs):
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =  576002248
 Offset     =          0
 The total memory requirement is    13183 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      29027.5284      0.3187      0.3175      0.3198  
Scale:     27112.9345      0.3414      0.3399      0.3435  
Add:       30289.4702      0.4571      0.4564      0.4576  
Triad:     30559.9206      0.5002      0.4524      0.8271  
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
Received on Tue Feb 10 16:22:08 2004

This archive was generated by hypermail 2.1.8 : Wed Feb 11 2004 - 16:09:54 CST