Stream results for an rx7640 with Dual-Core Itanium 2

From: Kirby L. Collins <kcollins@rsn.hp.com>
Date: Wed Sep 20 2006 - 21:32:02 CST

Stream results for the HP Integrity rx7640 with the sx2000 chipset, and
1.6GHz/18MB Dual-Core Intel(R) Itanium(R) 2 processors, running HP-UX 11.23:

1 cell, 4 processors/8 cores
32GB of memory (16x2GB DIMMs)
HPUX 11.23.0609, HP C 11.23.12
------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 9682.9470 0.1191 0.1190 0.1195
Scale: 9620.5242 0.1200 0.1197 0.1203
Add: 10619.6350 0.1630 0.1627 0.1635
Triad: 10693.4884 0.1619 0.1616 0.1623

2 cells, 8 processors/16 cores
64GB of memory (32x2GB DIMMs)
HPUX 11.23.0609, HP C 11.23.12
--------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 19441.2328 0.0593 0.0593 0.0594
Scale: 18986.8998 0.0611 0.0607 0.0623
Add: 21156.1625 0.1027 0.0817 0.2704
Triad: 21432.2426 0.1022 0.0806 0.2736

Note that with these processors the front side busses run at 533MT/sec.

This system was configured with half of the memory in each cell assigned to
local memory.

The runs used the v5.6 C version of the stream benchmark, with the following
changes:

57c57
< # define N 2000000

---
> # define N    72000640
compiled as follows:
cc -o stream_c.omp +Ofaster +DSitanium2 +DD64 +O4 +Odataprefetch +FPD
-Wl,+pd,1M +Oopenmp stream.c -lm
and run with the "mpsched -T FILL" command to launch the threads fill first
across the locality domains.  By default memory pages are allocated from
memory local to the faulting thread.  Here is the output for this
configuration: 
1 cell, 4 processors, 32GB of Memory (16x2GB DIMMs):
------------------------------------------------------
mpsched -T FILL stream_c.omp
-------------------------------------------------------------
STREAM version $Revision: 5.6 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 72000640, Offset = 0
Total memory required = 1648.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 8
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 84783 microseconds.
   (= 84783 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        9682.9470       0.1191       0.1190       0.1195
Scale:       9620.5242       0.1200       0.1197       0.1203
Add:        10619.6350       0.1630       0.1627       0.1635
Triad:      10693.4884       0.1619       0.1616       0.1623
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
2 cells, 8 processors, 64GB of Memory (32x2GB DIMMs):
------------------------------------------------------
mpsched -T FILL stream_c.omp
-------------------------------------------------------------
STREAM version $Revision: 5.6 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 72000640, Offset = 0
Total memory required = 1648.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 16
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 43979 microseconds.
   (= 43979 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       19441.2328       0.0593       0.0593       0.0594
Scale:      18986.8998       0.0611       0.0607       0.0623
Add:        21156.1625       0.1027       0.0817       0.2704
Triad:      21432.2426       0.1022       0.0806       0.2736
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Received on Thu Sep 21 07:34:36 2006

This archive was generated by hypermail 2.1.8 : Thu Sep 21 2006 - 07:53:23 CST