STREAM results on Stampede Xeon Phi SE10P coprocessors

From: John McCalpin <>
Date: Thu Jan 03 2013 - 17:43:33 CST

System: Intel Xeon Phi SE10P coprocessor in Stampede compute node (c559-003=
Processor: Intel Xeon Phi SE10P coprocessor (Revision B1, 61 cores, 1.1 GHz=
Memory: 8 GiB GDDR5
O/S: Linux (
Compiler: Intel icc (ICC) 13.0.1 20121010
Compile Flags: -mmic -ffreestanding -DUSE_TSC -O3 -openmp -mP2OPT_hlo_use_c=
onst_pref_dist=64 -mP2OPT_hlo_use_const_second_pref_dist=32 -mcmodel==
medium -mGLOB_default_function_attrs="knc_stream_store_controls=2" -DVE=
Runtime environment: KMP_AFFINITY=scatter, OMP_NUM_THREADS=61
Execution: ./stream.knc_freestanding_double.200M

1. This used the new version of stream.c (revision 5.10)
2. Results were very similar for all suitably large array sizes (100 millio=
n to 400 million elements per array), with reduced performance at smaller s=
izes (e.g., -4% at 20 million elements per array).
    The smallest array sizes used (16 million elements per array) resulted =
in kernel execution times of about 2 milliseconds, so these may suffer from=
 OpenMP barrier overhead.
3. The compiler flag ffreestanding prevents the compiler from replacing =
the STREAM Copy kernel with a call to a library routine.
4. Reported bandwidth was essentially identical when compiled for 32-bit ar=
rays with -DSTREAM_TYPE=float
5. On the Xeon Phi processors, software prefetching (inserted by the compil=
er) is important to obtain high bandwidth. Reducing the optimization flags=
 to "-O3" results in about a 15%-20% decrease in sustained memory bandwidth=
 for most array sizes (relative to the best results available).
6. The best results with the compiler flags above were obtained with one th=
read per core. Using 2, 3, or 4 threads per core decreased the Triad bandw=
idth by approximately 4%, 8%, and 12%, respectively. This is likely due to=
 the ratio of memory data streams to DRAM banks, but analysis will have to =
be deferred until Intel provides better hardware performance counters.

Sample output:
STREAM version $Revision: 1.5 $
This system uses 8 bytes per array element.
Array size = 200000000 (elements), Offset = 0 (elements)
Memory per array = 1525.9 MiB (= 1.5 GiB).
Total memory required = 4577.6 MiB (= 4.5 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
Number of Threads requested = 61
Number of Threads counted = 61
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 20194 microseconds.
   (= 20194 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
Function Best Rate MB/s Avg time Min time Max time
Copy: 150475.0527 0.0213 0.0213 0.0215
Scale: 150375.5846 0.0214 0.0213 0.0217
Add: 160470.7413 0.0300 0.0299 0.0304
Triad: 161828.9903 0.0298 0.0297 0.0303
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Results Validation Verbose Results:
    Expected a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307=
    Observed a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307=
    Rel Errors on a, b, c: 0.000000e+00 0.000000e+00 0.000000e+00

Received on Fri Jan 04 15:30:47 2013

This archive was generated by hypermail 2.1.8 : Thu Jan 17 2013 - 15:42:31 CST