STREAM results of a Quad FX-72 (4 processor)

From: Dan O'Connell <oconnell@lettis.com>
Date: Thu Jun 07 2007 - 11:55:47 CDT

Hi,

I did some benchmarking on dual-core Opterons (Socket 1207) with 667 DDR2. Only got about 20% increase over previous DDR-400 HMz
dual single core opterons going from 2 to 4 processors (about 11 Gb/s with the new dual cores versus about 9 GB/s with two older (254) single cores). Don't have the output since I left Reclamation and did the benchmarking there.

Here's a decent floating point box for now although it clearly only has enough memory bandwidth for about two processors. AMD really needs to update to HT 2/3, although maybe the memory hardware/modules aren't ready yet...

Avadirect AMD Quad FX Workstation (two AMD FX-72 dual-core precessors).
8 GB of 800 MHz DDR2 memory (4 2GB modules).

OpenSuse 10.2

Linux 2.6.18.8-0.3-default #1 SMP Tue Apr 17 08:42:35 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux

Pathscale 2.4

pathf95 -Ofast -CG:load_exe=2 -LNO:blocking=off -msse -msse2 -m3dnow -mp basic_stream.f -DUNDERSCORE second_wall.c -o ps2p4_openmp_stream -static

4 processor result

** OpenMP warning: requested pthread stack too large, using 8388608 bytes instea
d
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 16000000
 Offset = 0
 The total memory requirement is 366 MB
 You are running each test 20 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 Number of Threads = 4
 Number of Threads = 4
 Number of Threads = 4
 Number of Threads = 4
 ----------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds
 ----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 14810.6406 0.0332 0.0173 0.0523
Scale: 15003.0995 0.0300 0.0171 0.0504
Add: 15325.0115 0.0406 0.0251 0.0800
Triad: 15498.2846 0.0390 0.0248 0.0783
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

real 0m3.186s
user 0m5.944s
sys 0m0.272s

Single processor

pathf95 -Ofast -CG:load_exe=2 -LNO:blocking=off -msse -msse2 -m3dnow basic_stream.f -DUNDERSCORE second_wall.c -o ps2p4_scalar_stream -static

numactl --localalloc ./ps2p4_scalar_stream
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 16000000
 Offset = 0
 The total memory requirement is 366 MB
 You are running each test 20 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds
 ----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 8232.2594 0.0311 0.0311 0.0313
Scale: 8065.0605 0.0318 0.0317 0.0319
Add: 7576.9651 0.0508 0.0507 0.0510
Triad: 7587.7452 0.0507 0.0506 0.0512
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------

Cheers, DRHO

Daniel R.H. O'Connell, Ph.D.
Senior Geophysicist
William Lettis and Associates, Inc.
433 Park Point Drive, Suite 250
Golden, CO 80401
oconnell@lettis.com
Received on Fri Jun 08 10:57:09 2007

This archive was generated by hypermail 2.1.8 : Mon Jun 18 2007 - 10:04:57 CDT