STREAM results from HP Moonshot M300 from John D. McCalpin, Ph.D. on 2016-01-26 (STREAM e-mail archives for 2014)

From: John D. McCalpin, Ph.D. <john.mccalpin_at_att.net>
Date: Tue, 26 Jan 2016 11:40:14 -0600

These tests were run on 2014-06-19, but were never submitted.

The STREAM code was compiled with the Intel 14 compiler targeting the
c2750 (Silvermont) Atom processor.
The best results were obtained using 4 cores and KMP_AFFINITY=scatter so
that I used one core of each pair.
Performance degraded slightly when using all 8 cores. I did not attempt
to determine whether this was due to contention for the shared L2 caches
or contention at the DRAMs.

The results included here are the median of three tests run on an idle
system, with KMP_AFFINITY used to prevent thread migration.
Turbo mode was enabled, and the results were about 1% higher than
results with the frequency pinned to the nominal value of 2.4 GHz.
==================================================================
Single-thread results:
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf
11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 8 cores/pkg x 1 threads/core
(8 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
  The *best* time for each kernel (excluding the first iteration)
  will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 22813 microseconds.
    (= 22813 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 7056.2919 0.0227 0.0227 0.0229
Scale: 6920.3659 0.0232 0.0231 0.0234
Add: 7054.2892 0.0340 0.0340 0.0341
Triad: 6935.5998 0.0346 0.0346 0.0347
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

==================================================================
Four-thread results:
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf
11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 8 cores/pkg x 1 threads/core
(8 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6}
-------------------------------------------------------------
STREAM version $Revision: 1.5 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
  The *best* time for each kernel (excluding the first iteration)
  will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 13098 microseconds.
    (= 13098 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 14891.9013 0.0108 0.0107 0.0110
Scale: 14813.6647 0.0108 0.0108 0.0109
Add: 15313.9665 0.0157 0.0157 0.0159
Triad: 15487.8523 0.0156 0.0155 0.0156
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

==================================================================
Eight-thread results:
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf
11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected:
{0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 8 cores/pkg x 1 threads/core
(8 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {3}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {5}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}
-------------------------------------------------------------
STREAM version $Revision: 1.5 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
  The *best* time for each kernel (excluding the first iteration)
  will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 12964 microseconds.
    (= 12964 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 13684.5155 0.0118 0.0117 0.0118
Scale: 13699.8804 0.0117 0.0117 0.0117
Add: 14356.6798 0.0168 0.0167 0.0168
Triad: 14430.7724 0.0167 0.0166 0.0167
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Received on Tue Jan 26 2016 - 11:40:17 CST

This archive was generated by hypermail 2.3.0 : Tue Jan 26 2016 - 12:12:16 CST