Stream results for an rx8640 with Dual-Core Itanium 2 processors

From: Kirby L. Collins <kcollins@rsn.hp.com>
Date: Wed Sep 20 2006 - 21:32:55 CST

Stream results for the HP Integrity rx8640 with the sx2000 chipset, and
1.6GHz/24MB Dual-Core Intel(R) Itanium(R) 2 processors, running HP-UX:

2 cells, 8 processors/16 cores
64GB of memory (32x2GB DIMMs)
HP-UX 11.23.0609, HP f90 11.23.32
-----------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 19296.9724 0.2123 0.2123 0.2124
Scale: 19174.4903 0.2137 0.2136 0.2139
Add: 21318.3320 0.2886 0.2882 0.2898
Triad: 21391.3104 0.2874 0.2872 0.2875

4 cells, 16 processors/32 cores
128GB of memory (64x2GB DIMMs)
HP-UX 11.23.0609, HP f90 11.23.32
-----------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 38660.1229 0.1060 0.1059 0.1060
Scale: 38354.8453 0.1069 0.1068 0.1071
Add: 42440.6845 0.1449 0.1448 0.1450
Triad: 42785.6041 0.1443 0.1436 0.1449

Note that with these processors the front side busses run at 533MT/sec.

Thes system was configured with half of the memory in each cell assigned to
local memory.

The runs used the v5.6 f90 version of the stream benchmark, with the
following changes:

97c97
< PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)

---
>       PARAMETER (n=256002208,offset=0,ndim=n+offset,ntimes=10)
101c101
<       INTEGER j,k,nbpw,quantum
---
>       INTEGER*8 j,k,nbpw,quantum
106c106
<       INTEGER bytes(4)
---
>       INTEGER*8 bytes(4)
124c124
< *     COMMON a,b,c
---
>       COMMON a,b,c
245c245
<  9020 FORMAT (1x,a,i4,a)
---
>  9020 FORMAT (1x,a,i6,a)
247c247
<  9040 FORMAT ('Function',5x,'Rate (MB/s)  Avg time   Min time  Max time'
---
>  9040 FORMAT ('Function',5x,'Rate (MB/s)    Avg time   Min time  Max time'
249c249
<  9050 FORMAT (a,4 (f10.4,2x))
---
>  9050 FORMAT (a,f12.4,2x,3 (f10.4,2x))
compiled as follows:
	f90 -o stream_d.omp +Ofaster +DSitanium2 +DD64 +extend_source
+autodbl4 +noppu -Wl,+pd,1M +Oopenmp stream.f mysecond.o
 and run with the "mpsched -T FILL" command to distribute threads across
locality domains.  By default each thread allocated memory from the local
memory in each cell.  Here are the outputs for each configuration:
2 cells, 8 processors, 64GB of Memory (32x2GB DIMMs):
-------------------------------------------------------
mpsched -T FILL stream_d.omp
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size =  256002208
 Offset     =          0
 The total memory requirement is   5859 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads =  16
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)    Avg time   Min time  Max time
Copy:        19296.9724      0.2123      0.2123      0.2124
Scale:       19174.4903      0.2137      0.2136      0.2139
Add:         21318.3320      0.2886      0.2882      0.2898
Triad:       21391.3104      0.2874      0.2872      0.2875
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
4 cells, 16 processors, 128 GB of memory (64x2GB DIMMs):
----------------------------------------------------------
mpsched -T FILL stream_d.omp
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 ----------------------------------------------
 STREAM Version $Revision: 5.6 $
 ----------------------------------------------
 Array size =  256002208
 Offset     =          0
 The total memory requirement is   5859 MB
 You are running each test  10 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------
 Number of Threads =  32
 ----------------------------------------------
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 Printing one line per active thread....
 ----------------------------------------------------
 Your clock granularity/precision appears to be      1 microseconds
 ----------------------------------------------------
Function     Rate (MB/s)    Avg time   Min time  Max time
Copy:        38660.1229      0.1060      0.1059      0.1060
Scale:       38354.8453      0.1069      0.1068      0.1071
Add:         42440.6845      0.1449      0.1448      0.1450
Triad:       42785.6041      0.1443      0.1436      0.1449
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
Received on Thu Sep 21 07:34:37 2006

This archive was generated by hypermail 2.1.8 : Thu Sep 21 2006 - 07:53:29 CST