RE: AMD Opteron 6100 4-socket = 110 Gb/s

From: O'Connell, Dan [FWLA] <d.oconnell@fugro.com>
Date: Wed Aug 04 2010 - 01:12:20 CDT

Hi John,
This may not be the fastest compiler available but it gives a decent idea of performance. I max out at 105 Gb/s with Openf95.
System is a Supermicro Quad socket:
Motherboard:
Supermicro MBD-H8QGi-F SM, AMD Quad Socket G34, 32x DIMM's, 2 x16 PCIe 2.0, 2 x8 PCIe 2.0, 1x UIO,
IPMI 2.0, VGA
Processors:
Four OS6136WKT8EGO AMD Magny-Cours 6136 8-Core 2.4GHz CPU, 16MB Cache, 45nm, Socket G34 for a total of 32 processors
Memory:
16 KVR1333D3D4R9S/4G 4GB 1333MHz DDR3 ECCRegw/Parity CL9-DIMM DualRank, x4 chips (half the system memory slots)
Centos 5.5
Fortran with auto-parallelization first, then cc with OpenMp.
openf95 --version
Open64 Compiler Suite: Version 4.2.4
Built on: 2010-06-28 16:07:57 -0700
Thread model: posix
GNU gcc version 4.2.0 (Open64 4.2.4 driver)

openf95 -Ofast -mso -apo -march=auto basic_stream.f -DUNDERSCORE
second_wall.c -o open_auto_stream -static

time ./open_auto_stream
----------------------------------------------  Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size = 64000000
 Offset = 0
 The total memory requirement is 1464 MB
 You are running each test 20 times
 --
 The *best* time for each test is used
 *EXCLUDING* the first and last iterations
 ----------------------------------------------------
 Your clock granularity/precision appears to be 1 microseconds
 ----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 97125.06 0.0107 0.0105 0.0108
Scale: 95916.91 0.0108 0.0107 0.0109
Add: 93991.38 0.0166 0.0163 0.0168
Triad: 105047.38 0.0148 0.0146 0.0150
 ----------------------------------------------------
 Solution Validates!
 ----------------------------------------------------
 
real 0m1.757s
user 0m33.986s
sys 0m14.120s

opencc --version
Open64 Compiler Suite: Version 4.2.4
Built on: 2010-06-28 16:07:57 -0700
Thread model: posix
GNU gcc version 4.2.0 (Open64 4.2.4 driver)

opencc -Ofast -mso -mp -march=auto stream.c second_wall.c -o
open_openmp_stream -static
time ./open_openmp_stream
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 64000000, Offset = 0
Total memory required = 1464.8 MB.
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 32
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 93292 microseconds.
   (= 93292 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 96860.0265 0.0107 0.0106 0.0109
Scale: 98321.2530 0.0107 0.0104 0.0109
Add: 103448.3187 0.0150 0.0148 0.0152
Triad: 102488.8792 0.0151 0.0150 0.0152
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
real 0m3.213s
user 0m37.478s
sys 0m14.206s

Cheers, DRHO


Daniel R.H. O'Connell, Ph.D.
Principal Geophysicist
Fugro William Lettis & Associates, Inc.
________________________________
Received on Wed Aug 04 14:03:44 2010

This archive was generated by hypermail 2.1.8 : Thu Jul 21 2011 - 13:04:59 CDT