STREAM results on Stampede compute node

From: John McCalpin <mccalpin@tacc.utexas.edu>
Date: Thu Jan 03 2013 - 14:41:51 CST

System: Dell DCS8000 --- one of the compute nodes in the TACC Stampede syst=
em (c559-003 in the current configuration)
Processors: 2 Intel Xeon E5-2680 (2.70 GHz)
Memory: 32 GiB DDR3/1600 (8 DIMMs of 4 GiB each)
O/S: RHEL6.3 (2.6.32-279.el6.x86_64)
Compiler: Intel icc (ICC) 13.0.1 20121010
Compile Flags: -xAVX -O3 -ffreestanding -openmp -mcmodel=medium -DVERBOSE=
 -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=20000000
Runtime environment: KMP_AFFINITY=compact, OMP_NUM_THREADS=16
Execution: numactl –l ./stream.snb_O3_freestanding_double.20M

Comments:
1. This used the new version of stream.c (revision 5.10)
2. Results were very similar for all suitably large array sizes (20 million=
 to 1 billion elements per array), with a slight reduction (about 2%) in th=
e Triad results for the larger array sizes.
    Note that the array size of 20M used below is the minimum allowed for t=
his system, making each array ~4 times the size of the total L3 cache (2x20=
MB).
    Here I ignore the difference between 10^6 and 2^20, so the minimum arra=
y size is 4*(10^6)/(2^20) = 3.8 times the size of the total L3 cache.
3. The compiler flag –ffreestanding prevents the compiler from replacing =
the STREAM Copy kernel with a call to a library routine.
4. Reported bandwidth was essentially identical when compiled for 32-bit ar=
rays with -DSTREAM_TYPE=float
5. On the Xeon E5-2670, the use of streaming stores significantly improves =
STREAM performance. When these were disabled (by compiling with -opt-stre=
aming-stores never), the performance decreased by ~1/3 for the Copy and Sca=
le kernels and by ~1/4 for the Add and Triad kernels.

Sample output:
-------------------------------------------------------------
STREAM version $Revision: 1.5 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 20000000 (elements), Offset = 0 (elements)
Memory per array = 152.6 MiB (= 0.1 GiB).
Total memory required = 457.8 MiB (= 0.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 16
Number of Threads counted = 16
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 3452 microseconds.
   (= 3452 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 75688.1114 0.0043 0.0042 0.0043
Scale: 76316.4428 0.0042 0.0042 0.0042
Add: 76579.1525 0.0063 0.0063 0.0063
Triad: 77009.7510 0.0063 0.0062 0.0063
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Results Validation Verbose Results:
    Expected a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307=
546875000.000000
    Observed a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307=
546875000.000000
    Rel Errors on a, b, c: 0.000000e+00 0.000000e+00 0.000000e+00
-------------------------------------------------------------



Received on Fri Jan 04 15:30:45 2013

This archive was generated by hypermail 2.1.8 : Thu Jan 17 2013 - 15:42:25 CST