tuned STREAM results (Fujitsu SPARC M12-2, 12 cores)

From: 木村 茂 <kimura.shigeru_at_jp.fujitsu.com>
Date: Fri, 31 Mar 2017 13:48:21 +0900

Dear Dr. McCalpin,

  We have measured STREAM benchmark("tuned") on Fujitsu SPARC M12-2.
  Please publish this score on the STREAM Web site on April 4, 2017 or later.

           System Name: Fujitsu SPARC M12-2
              CPU Name: SPARC64 XII
               CPU MHz: 3900
        CPU(s) enabled: 12 cores, 1 chip, 12 cores/chip, 8 threads/core
         Primary Cache: 64 KB I + 64 KB D on chip per core
       Secondary Cache: 512 KB I+D on chip per core
              L3 Cache: 32 MB I+D on chip per chip
           Other Cache: None
                Memory: 512 GB (16 x 32 GB 2Rx4 PC4-2400T-R, ECC)
      Operating System: Oracle Solaris 11.3 a next SRU
              Compiler: Version 12.6 of Oracle Developer Studio
     Compilation Flags: -fast -m64 -xopenmp -xtarget=sparc64xplus -xipo=2
$B!!!!!!!!!!!!!!!!!!!!!!(B -xpagesize=4M -xlinkopt -xvector -xprefetch_level=3
$B!!!!!!!!!!!!!!!!!!!!!!(B -xprefetch=latx:8.0
    STREAM Source Code: The following tuning is applied to Fortran version(v5.6)
                Tuning: xfill (stxa with ASI_XFILL_P(0xf2)) instructions are
                        used to reduce memory read transactions.
           OS Settings: (/etc/system parameters)
                        autoup=86400$B!!(Bdoiflush=0$B!!(Bdopageflush=0
                        zfs:zfs_arc_max=1073741824
     Shell Environment: OMP_NUM_THREADS=24
                        SUNW_MP_PROCBIND="1 5 9 13 17 21 25 29 33 37 41 45
$B!!!!!!!!!!!!!!!!!!!!!!!!(B 49 53 57 61 65 69 73 77 81 85 89 93"
                   Run: <stream>

  Outputs:

----------------------------------------------
  Double precision appears to have 16 digits of accuracy
  Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
  ----------------------------------------------
  STREAM Version $Revision: 5.6 $
  ----------------------------------------------
  Array size = 2000000000
  Offset = 48
  The total memory requirement is 720 MB
  You are running each test 10 times
  --
  The *best* time for each test is used
  *EXCLUDING* the first and last iterations
  ----------------------------------------------
  Number of Threads = 24
  ----------------------------------------------
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  Printing one line per active thread....
  ----------------------------------------------------
  Your clock granularity/precision appears to be 1 microseconds
  ----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 122797.3307 0.2608 0.2606 0.2610
Scale: 122306.8732 0.2618 0.2616 0.2619
Add: 127416.6518 0.3779 0.3767 0.3801
Triad: 127750.2441 0.3775 0.3757 0.3801
  ----------------------------------------------------
  Solution Validates!
  ----------------------------------------------------
Received on Sat Apr 01 2017 - 16:15:34 CDT

This archive was generated by hypermail 2.3.0 : Mon Apr 03 2017 - 19:20:37 CDT