HP Integrity Superdome with sx2000 stream results

From: Kirby L. Collins <kcollins@rsn.hp.com>
Date: Tue Mar 21 2006 - 12:42:30 CST

Stream results for HP Integrity Superdome with the sx2000 chipset,
and 1.6GHz/9MB Itanium 2 processors:

16 cells, 64 processors, 512GB of memory (512x1GB DIMMs):
---------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 116140.9092 0.0399 0.0397 0.0404
Scale: 114618.0682 0.0406 0.0402 0.0427
Add: 127867.5927 0.0542 0.0541 0.0548
Triad: 128686.0452 0.0539 0.0537 0.0541

8 cells, 32 processors, 256GB of memory (256x1GB DIMMs):
--------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 58128.9633 0.0794 0.0793 0.0798
Scale: 57734.9657 0.0802 0.0798 0.0818
Add: 64496.3297 0.1075 0.1072 0.1089
Triad: 64715.4557 0.1736 0.1068 0.7050

4 cells, 16 processors, 128GB of memory (128x1GB DIMMs):
--------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 29035.9252 0.1767 0.1587 0.3159
Scale: 28979.2846 0.1688 0.1590 0.2464
Add: 32356.9694 0.2140 0.2136 0.2152
Triad: 32406.5934 0.2137 0.2133 0.2148

The system was configured with half of the memory in each
cell assigned to local memory.

The runs used the v5.6 f90 version of the stream benchmark, with
the following changes:

97c97
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)
---
> PARAMETER (n=288000064,offset=0,ndim=n+offset,ntimes=10)
124c124
< * COMMON a,b,c
---
> COMMON a,b,c
245c245
< 9020 FORMAT (1x,a,i4,a)
---
> 9020 FORMAT (1x,a,i6,a)
247c247
< 9040 FORMAT ('Function',5x,'Rate (MB/s) Avg time Min time Max time'
---
> 9040 FORMAT ('Function',5x,'Rate (MB/s) Avg time Min time Max time'
249c249
< 9050 FORMAT (a,4 (f10.4,2x))
---
> 9050 FORMAT (a,f12.4,2x,3 (f10.4,2x))

compiled as follows:

f90 -o stream_d.omp +Ofaster +DD64 +extend_source +autodbl4 +noppu
-Wl,+pd,1M +Oopenmp stream.f mysecond.o

and run with the mpsched -P RR command to distribute the processes in a
round robin
fashion across the locality domains. By default each thread allocated
memory
from the local memory in each cell. Here are the outputs for each
configuration:

16 cells, 64 processors, 512 GB of memory (512x1GB DIMMs):

mpsched -T RR stream_d.omp
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size = 288000064
 Offset = 0
 The total memory requirement is 6591 MB
 You are running each test 10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads = 64
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds
 ----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 116140.9092 0.0399 0.0397 0.0404
Scale: 114618.0682 0.0406 0.0402 0.0427
Add: 127867.5927 0.0542 0.0541 0.0548
Triad: 128686.0452 0.0539 0.0537 0.0541
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

8 cells, 32 processors, 256 GB of memory (256x1GB DIMMs):

mpsched -T RR stream_d.omp
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size = 288000064
 Offset = 0
 The total memory requirement is 6591 MB
 You are running each test 10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads = 32
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds
 ----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 58128.9633 0.0794 0.0793 0.0798
Scale: 57734.9657 0.0802 0.0798 0.0818
Add: 64496.3297 0.1075 0.1072 0.1089
Triad: 64715.4557 0.1736 0.1068 0.7050
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------


4 cells, 16 processors, 128 GB of memory (128x1GB DIMMs):

mpsched -T RR stream_d.omp
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size = 288000064
 Offset = 0
 The total memory requirement is 6591 MB
 You are running each test 10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads = 16
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds
 ----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 29035.9252 0.1767 0.1587 0.3159
Scale: 28979.2846 0.1688 0.1590 0.2464
Add: 32356.9694 0.2140 0.2136 0.2152
Triad: 32406.5934 0.2137 0.2133 0.2148
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------




Received on Tue Mar 21 21:43:45 2006

This archive was generated by hypermail 2.1.8 : Wed Mar 22 2006 - 08:28:00 CST