STREAM update for Intel Xeon Phi SE10P

From: John McCalpin <mccalpin_at_tacc.utexas.edu>
Date: Tue Sep 17 2013 - 13:31:03 CDT

New STREAM results for the Intel Xeon Phi SE10P
Note: since the initial Xeon Phi SE10P submission on 2013-01-03, the operating
system software on the Xeon Phi SE10P coprocessors of the Stampede system
was updated to include "transparent huge page" support.
These performance numbers now match the values previously seen with
explicit use of large pages via the (inconvenient) "hugetlbfs" mechanism.

Following the instructions on:
http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad

Source:
stream_5-10.c (standard)

Compiler version:
module: intel/13.1.1.163.
$ icc --version
icc (ICC) 13.1.1 20130313

Compile options:
$ icc -mmic -O3 -openmp -DNTIMES=100 -DSTREAM_ARRAY_SIZE=64000000
-opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming
-stores always stream_5-10.c -o stream_intelopt.100x.mic

Run on Stampede node c408-803

This run had the most "balanced" results... Of the 11 runs I tried, this
was the fastest for Scale and Add, third fastest for Copy and 4th fastest for Triad.
So it is on the high end of the results, but they were all very close. For
 example, the Add result was just under 0.9% faster than the median result
from the 11 runs, while the Triad result was about 0.4% slower than the fastest of
the 11 runs.

~/Stampede/STREAM $ ./stream_intelopt.100x.mic
-------------------------------------------------------------
STREAM version $Revision: 1.5 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 64000000 (elements), Offset = 0 (elements)
Memory per array = 488.3 MiB (= 0.5 GiB).
Total memory required = 1464.8 MiB (= 1.4 GiB).
Each kernel will be executed 100 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 60
Number of Threads counted = 60
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 9571 microseconds.
   (= 9571 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 169446.7707 0.0062 0.0060 0.0063
Scale: 169173.1249 0.0062 0.0061 0.0063
Add: 174824.3180 0.0090 0.0088 0.0091
Triad: 174663.1678 0.0089 0.0088 0.0091
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------




--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
http://www.tacc.utexas.edu/~mccalpin/


Received on Tue Sep 17 17:08:11 2013

This archive was generated by hypermail 2.1.8 : Tue Sep 17 2013 - 17:31:04 CDT