STREAM Results from TACC Lonestar 4 system Xeon X5680 (Westmere EP)

From: John McCalpin <mccalpin_at_tacc.utexas.edu>
Date: Tue, 2 Jun 2015 20:48:02 +0000

Older results from one of the compute nodes in the TACC Lonestar 4 system.
NOTE: These processors were released in March 2010, the TACC Lonestar 4 system opened for production use in January 2012, and these tests were run in September 2012.

The nodes are Dell PowerEdge M610 servers with two 6-core Xeon X5680 (Westmere EP) processors running at 3.33 GHz, with 12 MiB of L3 cache per socket.
Max Turbo frequency is 3.466 GHz using all cores or 3.600 GHz using 1 or 2 cores.
HyperThreading is disabled.
These processors were released in March 2010, the TACC Lonestar 4 system opened for production use in January 2012, and these tests were run in September 2012.
Each compute node is equipped with 6 4 GiB DDR3/1333 DIMMs -- one per channel (2 sockets with 3 DRAM channels each), for a peak memory bandwidth of 32 GB/s per socket or 64 GB/s per node.
I don't have the DRAM model number handy, but these are almost certainly dual-rank DIMMs composed of 18 2 Gbit (256Mbit x8) parts.

The array size used was 20 million, making each array about 6.3 times the size of the combined 24 MiB of L3 cache, so the problem size is large enough.
The codes were compiled with the Intel 11.1 compiler, using "-O2 -ffreestanding -openmp", with the "-opt-streaming-stores never" flag added for the cases where I did not want streaming stores. The tests were all run with "-O3" as well as "-O2", but the results did not differ beyond the natural run-to-run variability.
Similarly, there was no significant difference between the performance of codes compiled with OpenMP and run with 1 thread and the performance when compiled without OpenMP support.

Full results are included for 1 thread and 12 threads with "compact" placement, "-O3" optimization, and for compilation with and without streaming stores.
Summary results are included for all the cases run. This may be useful for seeing performance scaling using 1-6 cores per socket, for example, or to see the detailed differences in performance between the cases with and without streaming stores as a function of thread count.

Summary of results using all 12 cores:
kernel: w/streaming stores w/o streaming stores
Copy 40165.7 26893.0
Scale 40206.6 26664.4
Add 40955.8 29446.2
Triad 41687.7 29491.5

john

--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
https://www.tacc.utexas.edu/about/directory/john-mccalpin



Received on Wed Jun 03 2015 - 07:45:50 CDT

This archive was generated by hypermail 2.3.0 : Tue Jul 28 2015 - 13:12:07 CDT