[stream] STREAM results for HP Integrity Superdome

From: Kirby Collins <kirby.collins@hp.com>
Date: Mon Sep 22 2003 - 19:35:03 CDT

Stream results for HP Integrity Superdome with 1500MHz,
6MB L3 cache Itanium2 processors:

16 cells, 64 cpus, 256GB of memory (512x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 82695.4003 0.0995 0.0991 0.0997
Scale: 82475.6402 0.0994 0.0994 0.0995
Add: 83012.8911 0.1488 0.1481 0.1527
Triad: 84222.7077 0.1472 0.1459 0.1516

8 cells, 32 cpus, 128GB of memory (256x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 41336.0147 0.0992 0.0991 0.0993
Scale: 41293.9520 0.0996 0.0992 0.1003
Add: 41484.8709 0.1485 0.1482 0.1492
Triad: 42187.6441 0.1460 0.1457 0.1463

4 cells, 16 cpus, 64GB of memory (128x512MB DIMMs):
Function Rate (MB/s) Avg time Min time Max time
Copy: 20701.4176 0.0996 0.0990 0.1012
Scale: 20699.3481 0.0991 0.0990 0.0993
Add: 20792.5427 0.1479 0.1478 0.1480
Triad: 21112.6698 0.1459 0.1456 0.1465

The system was booted with half of the memory in each
cell configured as local memory.

The runs used the mpi version of the stream benchmark, with
the following changes (second replaced with a C routine that
calls gettimeofday):

74c74
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)

---
>       PARAMETER (n=8002800,offset=0,ndim=n+offset,ntimes=10)
544,547c544,547
<       real*8 function second(dummy)
<       real*8 dummy, rtc
<       second = rtc()
<       end
---
> CKLC      real*8 function second(dummy)
> CKLC      real*8 dummy, rtc
> CKLC      second = rtc()
> CKLC      end
compiled as follows:
cc +DSitanium2 +DD64 +O3 +Odataprefetch -Wl,+pd,64M -c second_wall.c
mpif90 -o stream_d.mpi +Ofaster +DSitanium2 -Wl,+pd,64M +DD64 +Onoopenmp
+extend_source +noppu stream_mpi.f second_wall.o
and run with the mpsched -P RR command to distribute the processes in a
round robin
fashion across the locality domains.  By default the MPI tasks allocated
memory
from the local memory in each cell.  Here are the outputs for each
configuration:
16 cells, 64 processors, 256 GB of memory (512x512MB DIMMs):
mpsched -P RR mpirun -np 64 stream_d.mpi
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Number of processors =  64
 Array size =    8002800
 Offset     =          0
 The total memory requirement is   11722.9 MB (    183.2MB/task)
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      82695.4003      0.0995      0.0991      0.0997  
Scale:     82475.6402      0.0994      0.0994      0.0995  
Add:       83012.8911      0.1488      0.1481      0.1527  
Triad:     84222.7077      0.1472      0.1459      0.1516  
 -----------------------------------------------
 Solution Validates!
 -----------------------------------------------
8 cells, 32 processors, 128 GB of memory (256x512MB DIMMs):
mpsched -P RR mpirun -np 32 stream_d.mpi
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Number of processors =  32
 Array size =    8002800
 Offset     =          0
 The total memory requirement is    5861.4 MB (    183.2MB/task)
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      41336.0147      0.0992      0.0991      0.0993
Scale:     41293.9520      0.0996      0.0992      0.1003
Add:       41484.8709      0.1485      0.1482      0.1492
Triad:     42187.6441      0.1460      0.1457      0.1463
 -----------------------------------------------
 Solution Validates!
 -----------------------------------------------
4 cells, 16 processors, 64 GB of memory (128x512MB DIMMs):
mpsched -P RR mpirun -np 16 stream_d.mpi
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Number of processors =  16
 Array size =    8002800
 Offset     =          0
 The total memory requirement is    2930.7 MB (    183.2MB/task)
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:      20701.4176      0.0996      0.0990      0.1012
Scale:     20699.3481      0.0991      0.0990      0.0993
Add:       20792.5427      0.1479      0.1478      0.1480
Triad:     21112.6698      0.1459      0.1456      0.1465
 -----------------------------------------------
 Solution Validates!
 -----------------------------------------------
Received on Mon Sep 22 19:35:03 2003

This archive was generated by hypermail 2.1.8 : Tue Sep 23 2003 - 09:08:03 CDT