STREAM results on Haswell EP Xeon E5-2660 v3

From: John McCalpin <mccalpin_at_tacc.utexas.edu>
Date: Tue, 13 Oct 2015 21:02:30 +0000

Dell R630 with 2 Xeon E5-2660 v3 (10 core, 2.6 GHz, 105W) & 64 GiB of DDR4/2133 (one dual-rank 16 GiB DIMM per channel).

Compiled with icc 2015: -O3 -xCORE-AVX2 -ffreestanding -openmp -DSTREAM_ARRAY_SIZE=400000000

Summary:
Copy 109124 MB/s
Scale 109634 MB/s
Add 111862 MB/s
Triad 111760 MB/s

Notes:

  * The Uncore Frequency was set to "Maximum" in the BIOS.
     * C states are enabled.
     * "Energy Efficient Turbo" was disabled.
        * This mode limits the maximum Turbo frequency based on a "return on investment" CPI measurement.
        * For STREAM, "Energy Efficient Turbo" limits core frequencies to 2.6 GHz (vs 2.9 GHz) and reduces performance by slightly less than 1%.
     * Not all combinations of BIOS settings were tested -- I picked settings to maximize memory performance, control, and reproducibility, with much less concern for energy efficiency.
  * Turbo mode was enabled -- the cores ran at 2.9 GHz (max all-core Turbo frequency when using 256-bit registers)
  * Alternate Configurations:
     * Performance was slightly higher (~1%) when using 10-12 cores instead of all 20.
     * Performance was only ~1.7% lower when using 8 cores (4 per socket). This case gave the best energy efficiency.
     * When using all cores, performance was almost completely independent of frequency -- Triad results were only reduced by 1% at the minimum supported frequency of 1.2 GHz.
     * When using fewer than all cores:
        * 8-core (4 per socket) performance was somewhat dependent on frequency -- Triad results dropped by ~15% at 1.2GHz.
        * 12-core (6 per socket) performance was very weakly dependent on frequency -- Triad results dropped by less than 3% at 1.2 GHz (within 2% of 20-core results at 2.9 GHz).

Details:

~/WorkSpace/SystemMirrors/Discovery2/STREAM/Results/2015-04-22/DDR4_2133:2015-10-13T15:30:46 $ more log.scatter.2601000.20p.AVX2

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.

OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info

OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}

OMP: Info #156: KMP_AFFINITY: 20 available OS procs

OMP: Info #157: KMP_AFFINITY: Uniform topology

OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 1 threads/core (20 total cores)

OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:

OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0

OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1

OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2

OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3

OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4

OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 8

OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 9

OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 10

OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 11

OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 12

OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0

OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 1

OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2

OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 3

OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 4

OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 8

OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 9

OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 10

OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 11

OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 12

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 0 bound to OS proc set {0}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 1 bound to OS proc set {1}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 2 bound to OS proc set {2}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 3 bound to OS proc set {3}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 4 bound to OS proc set {4}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 5 bound to OS proc set {5}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 7 bound to OS proc set {7}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 6 bound to OS proc set {6}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 10 bound to OS proc set {10}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 9 bound to OS proc set {9}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 8 bound to OS proc set {8}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 11 bound to OS proc set {11}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 13 bound to OS proc set {13}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 12 bound to OS proc set {12}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 14 bound to OS proc set {14}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 15 bound to OS proc set {15}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 16 bound to OS proc set {16}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 17 bound to OS proc set {17}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 18 bound to OS proc set {18}

OMP: Info #242: KMP_AFFINITY: pid 18881 thread 19 bound to OS proc set {19}

-------------------------------------------------------------

STREAM version $Revision: 1.4 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 400000000, Offset = 0

Total memory required = 9155.3 MiB.

Each test is run 20 times, but only

the *best* time for each is used.

-------------------------------------------------------------

Number of Threads requested = 20

Number of Threads counted = 20

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 63379 microseconds.

   (= 63379 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 109099.2599 0.0588 0.0587 0.0597

Scale: 109233.7782 0.0587 0.0586 0.0591

Add: 111869.8591 0.0859 0.0858 0.0859

Triad: 111683.9932 0.0860 0.0860 0.0861

-------------------------------------------------------------

Solution Validates: avg error less than 1e-15 on all three arrays

-------------------------------------------------------------


 Performance counter stats for './stream.omp.AVX2':


     128298.046218 task-clock # 18.909 CPUs utilized

               846 context-switches # 0.007 K/sec

                35 cpu-migrations # 0.000 K/sec

             6,369 page-faults # 0.050 K/sec

   371,606,234,043 cycles # 2.896 GHz

   <not supported> stalled-cycles-frontend

   <not supported> stalled-cycles-backend

    77,311,273,853 instructions # 0.21 insns per cycle

    14,561,782,675 branches # 113.500 M/sec

         2,191,532 branch-misses # 0.02% of all branches


       6.784942866 seconds time elapsed


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.

OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info

OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}

OMP: Info #156: KMP_AFFINITY: 20 available OS procs

OMP: Info #157: KMP_AFFINITY: Uniform topology

OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 1 threads/core (20 total cores)

OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:

OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0

OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1

OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2

OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3

OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4

OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 8

OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 9

OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 10

OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 11

OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 12

OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0

OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 1

OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2

OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 3

OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 4

OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 8

OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 9

OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 10

OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 11

OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 12

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 0 bound to OS proc set {0}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 1 bound to OS proc set {1}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 3 bound to OS proc set {3}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 2 bound to OS proc set {2}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 4 bound to OS proc set {4}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 5 bound to OS proc set {5}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 7 bound to OS proc set {7}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 6 bound to OS proc set {6}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 8 bound to OS proc set {8}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 9 bound to OS proc set {9}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 10 bound to OS proc set {10}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 11 bound to OS proc set {11}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 12 bound to OS proc set {12}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 13 bound to OS proc set {13}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 14 bound to OS proc set {14}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 15 bound to OS proc set {15}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 17 bound to OS proc set {17}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 16 bound to OS proc set {16}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 18 bound to OS proc set {18}

OMP: Info #242: KMP_AFFINITY: pid 18903 thread 19 bound to OS proc set {19}

-------------------------------------------------------------

STREAM version $Revision: 1.4 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 400000000, Offset = 0

Total memory required = 9155.3 MiB.

Each test is run 20 times, but only

the *best* time for each is used.

-------------------------------------------------------------

Number of Threads requested = 20

Number of Threads counted = 20

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 63250 microseconds.

   (= 63250 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 109123.6528 0.0588 0.0586 0.0589

Scale: 109633.9575 0.0585 0.0584 0.0586

Add: 111862.0894 0.0859 0.0858 0.0860

Triad: 111760.2507 0.0860 0.0859 0.0861

-------------------------------------------------------------

Solution Validates: avg error less than 1e-15 on all three arrays

-------------------------------------------------------------


 Performance counter stats for './stream.omp.AVX2':


     128286.180255 task-clock # 18.873 CPUs utilized

               734 context-switches # 0.006 K/sec

                27 cpu-migrations # 0.000 K/sec

             6,355 page-faults # 0.050 K/sec

   371,867,953,614 cycles # 2.899 GHz

   <not supported> stalled-cycles-frontend

   <not supported> stalled-cycles-backend

    77,561,980,423 instructions # 0.21 insns per cycle

    14,638,892,990 branches # 114.111 M/sec

         2,146,510 branch-misses # 0.01% of all branches


       6.797216182 seconds time elapsed


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.

OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info

OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}

OMP: Info #156: KMP_AFFINITY: 20 available OS procs

OMP: Info #157: KMP_AFFINITY: Uniform topology

OMP: Info #179: KMP_AFFINITY: 2 packages x 10 cores/pkg x 1 threads/core (20 total cores)

OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:

OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0

OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1

OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2

OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3

OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4

OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 8

OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 9

OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 10

OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 11

OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 12

OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0

OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 1

OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2

OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 3

OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 4

OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 8

OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 1 core 9

OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 10

OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 11

OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 12

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 0 bound to OS proc set {0}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 1 bound to OS proc set {1}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 2 bound to OS proc set {2}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 3 bound to OS proc set {3}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 4 bound to OS proc set {4}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 6 bound to OS proc set {6}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 5 bound to OS proc set {5}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 8 bound to OS proc set {8}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 7 bound to OS proc set {7}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 9 bound to OS proc set {9}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 10 bound to OS proc set {10}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 11 bound to OS proc set {11}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 12 bound to OS proc set {12}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 13 bound to OS proc set {13}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 14 bound to OS proc set {14}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 15 bound to OS proc set {15}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 17 bound to OS proc set {17}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 18 bound to OS proc set {18}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 16 bound to OS proc set {16}

OMP: Info #242: KMP_AFFINITY: pid 18925 thread 19 bound to OS proc set {19}

-------------------------------------------------------------

STREAM version $Revision: 1.4 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 400000000, Offset = 0

Total memory required = 9155.3 MiB.

Each test is run 20 times, but only

the *best* time for each is used.

-------------------------------------------------------------

Number of Threads requested = 20

Number of Threads counted = 20

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 63374 microseconds.

   (= 63374 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 109216.8898 0.0587 0.0586 0.0588

Scale: 109468.9808 0.0585 0.0585 0.0586

Add: 112044.8076 0.0857 0.0857 0.0859

Triad: 111813.6308 0.0859 0.0859 0.0860

-------------------------------------------------------------

Solution Validates: avg error less than 1e-15 on all three arrays

-------------------------------------------------------------


 Performance counter stats for './stream.omp.AVX2':


     128260.580114 task-clock # 18.900 CPUs utilized

               720 context-switches # 0.006 K/sec

                28 cpu-migrations # 0.000 K/sec

             6,319 page-faults # 0.049 K/sec

   371,839,666,252 cycles # 2.899 GHz

   <not supported> stalled-cycles-frontend

   <not supported> stalled-cycles-backend

    77,729,679,802 instructions # 0.21 insns per cycle

    14,683,151,696 branches # 114.479 M/sec

         2,179,317 branch-misses # 0.01% of all branches


       6.786395421 seconds time elapsed



--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
https://www.tacc.utexas.edu/about/directory/john-mccalpin
Received on Thu Oct 15 2015 - 15:03:29 CDT

This archive was generated by hypermail 2.3.0 : Sun Nov 08 2015 - 22:03:43 CST