STREAM results on Stampede large memory node

From: John McCalpin <mccalpin@tacc.utexas.edu>
Date: Thu Jan 03 2013 - 13:54:12 CST

System: Dell PowerEdge 820 --- one of the "large memory" nodes in the TACC =
Stampede system (c400-101 in the current configuration)
Processors: 4 Intel Xeon E5-4650 (2.70 GHz)
Memory: 1024 GB DDR3/1333 (32 DIMMs of 32 GB each)
O/S: RHEL6.3 (2.6.32-279.el6.x86_64)
Compiler: Intel icc (ICC) 13.0.1 20121010
Compile Flags: -xAVX -O3 -ffreestanding -openmp -mcmodel=medium -DVERBOSE=
 -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=30000000000
Runtime environment: KMP_AFFINITY=compact, OMP_NUM_THREADS=32
Execution: numactl –l ./stream.snb_O3_freestanding_double.30000M

Comments:
1. This used the new version of stream.c (revision 5.10)
2. Results were nearly identical for all suitably large array sizes (100 mi=
llion to 30 billion elements per array)
3. The compiler flag –ffreestanding prevents the compiler from replacing =
the STREAM Copy kernel with a call to a library routine.
4. On the Xeon E5-4650, the use of streaming stores does not change STREAM =
performance — results were identical when compiled with -opt-streaming-st=
ores never
5. Reported bandwidth was essentially identical when compiled for 32-bit ar=
rays with -DSTREAM_TYPE=float

Sample output:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 30000000000 (elements), Offset = 0 (elements)
Memory per array = 228881.8 MiB (= 223.5 GiB).
Total memory required = 686645.5 MiB (= 670.6 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 32
Number of Threads counted = 32
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 3988409 microseconds.
   (= 3988409 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 75539.6505 6.3555 6.3543 6.3566
Scale: 75749.1105 6.3383 6.3367 6.3405
Add: 83372.5882 8.6375 8.6359 8.6389
Triad: 83381.4416 8.6372 8.6350 8.6384
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Results Validation Verbose Results:
    Expected a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307=
546875000.000000
    Observed a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307=
546875000.000000
    Rel Errors on a, b, c: 0.000000e+00 0.000000e+00 0.000000e+00
-------------------------------------------------------------



Received on Fri Jan 04 15:30:44 2013

This archive was generated by hypermail 2.1.8 : Thu Jan 17 2013 - 15:42:11 CST