tuned STREAM on IBM eServer p5 595 (1900 MHz, 64 cpu)

From: Frank Johnston (fjohn@us.ibm.com)
Date: Tue Nov 02 2004 - 09:57:57 CST

  • Next message: Kirby L. Collins: "Stream results for an HP Integrity rx1620"

    These are tuned STREAM results on an IBM eServer p5 595
    with sixty-four 1.9GHz cpus (36MB L3 cache). This is a POWER5 SMP machine.
    Large pages were used in all cases.

    Function Rate (MB/s) RMS time Min time Max time
    Copy: 158176.44 .03 .03 .03
    Scale: 153812.38 .03 .03 .03
    Add: 169687.38 .04 .04 .04
    Triad: 174567.44 .04 .04 .04

    Here is the full output file:
    ----------------------------------------------
     Requesting Large Pages
     Setting up for 8 CPUs per module
     Number of segments per array = 8
     CPU binding list : 0 8 16 24 32 40 48 56
     Shared Segment Pointer = 504403158265495552
     Shared Segment Pointer = 504403160412979200
     Shared Segment Pointer = 504403162560462848
     Segment Size (B) = 268435456 (MB = 256 )
     Array Size (B) = 2147483648 (MB = 2048 )
     Array Size (DW) = 268435456
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     Num_threads = 64
     rebind: num_parthds is 64
    GETSHRSEG: requesting large pages
    GETSHRSEG ENTRY: shmgetflag -2147481216
    bindprocessor successful: thread_self() 2511067 cpu_id 0
    bindprocessor successful: thread_self() 2511067 cpu_id 8
    bindprocessor successful: thread_self() 2511067 cpu_id 16
    bindprocessor successful: thread_self() 2511067 cpu_id 24
    bindprocessor successful: thread_self() 2511067 cpu_id 32
    bindprocessor successful: thread_self() 2511067 cpu_id 40
    bindprocessor successful: thread_self() 2511067 cpu_id 48
    bindprocessor successful: thread_self() 2511067 cpu_id 56
    GETSHRSEG: requesting large pages
    GETSHRSEG ENTRY: shmgetflag -2147481216
    bindprocessor successful: thread_self() 2511067 cpu_id 0
    bindprocessor successful: thread_self() 2511067 cpu_id 8
    bindprocessor successful: thread_self() 2511067 cpu_id 16
    bindprocessor successful: thread_self() 2511067 cpu_id 24
    bindprocessor successful: thread_self() 2511067 cpu_id 32
    bindprocessor successful: thread_self() 2511067 cpu_id 40
    bindprocessor successful: thread_self() 2511067 cpu_id 48
    bindprocessor successful: thread_self() 2511067 cpu_id 56
    GETSHRSEG: requesting large pages
    GETSHRSEG ENTRY: shmgetflag -2147481216
    bindprocessor successful: thread_self() 2511067 cpu_id 0
    bindprocessor successful: thread_self() 2511067 cpu_id 8
    bindprocessor successful: thread_self() 2511067 cpu_id 16
    bindprocessor successful: thread_self() 2511067 cpu_id 24
    bindprocessor successful: thread_self() 2511067 cpu_id 32
    bindprocessor successful: thread_self() 2511067 cpu_id 40
    bindprocessor successful: thread_self() 2511067 cpu_id 48
    bindprocessor successful: thread_self() 2511067 cpu_id 56
    bindprocessor successful: thread_self() 2793567 cpu_id 1
    bindprocessor successful: thread_self() 2502873 cpu_id 14
    bindprocessor successful: thread_self() 2859137 cpu_id 26
    bindprocessor successful: thread_self() 2707509 cpu_id 28
    bindprocessor successful: thread_self() 2805859 cpu_id 16
    bindprocessor successful: thread_self() 2506965 cpu_id 48
    bindprocessor successful: thread_self() 2490573 cpu_id 38
    bindprocessor successful: thread_self() 2809957 cpu_id 3
    bindprocessor successful: thread_self() 2596865 cpu_id 61
    bindprocessor successful: thread_self() 2715853 cpu_id 51
    bindprocessor successful: thread_self() 2818153 cpu_id 4
    bindprocessor successful: thread_self() 2613259 cpu_id 60
    bindprocessor successful: thread_self() 2592769 cpu_id 29
    bindprocessor successful: thread_self() 2474181 cpu_id 40
    bindprocessor successful: thread_self() 2535651 cpu_id 41
    bindprocessor successful: thread_self() 2453591 cpu_id 49
    bindprocessor successful: thread_self() 2543847 cpu_id 39
    bindprocessor successful: thread_self() 2641943 cpu_id 63
    bindprocessor successful: thread_self() 2637845 cpu_id 62
    bindprocessor successful: thread_self() 2785369 cpu_id 17
    bindprocessor successful: thread_self() 2584829 cpu_id 13
    bindprocessor successful: thread_self() 2678977 cpu_id 20
    bindprocessor successful: thread_self() 2564339 cpu_id 21
    bindprocessor successful: thread_self() 2572537 cpu_id 25
    bindprocessor successful: thread_self() 2773085 cpu_id 24
    bindprocessor successful: thread_self() 2740297 cpu_id 35
    bindprocessor successful: thread_self() 2478279 cpu_id 34
    bindprocessor successful: thread_self() 2703413 cpu_id 11
    bindprocessor successful: thread_self() 2486477 cpu_id 50
    bindprocessor successful: thread_self() 2764889 cpu_id 12
    bindprocessor successful: thread_self() 2605063 cpu_id 58
    bindprocessor successful: thread_self() 2629651 cpu_id 59
    bindprocessor successful: thread_self() 2498769 cpu_id 45
    bindprocessor successful: thread_self() 2531553 cpu_id 44
    bindprocessor successful: thread_self() 2867327 cpu_id 30
    bindprocessor successful: thread_self() 2527455 cpu_id 32
    bindprocessor successful: thread_self() 2494671 cpu_id 33
    bindprocessor successful: thread_self() 2560239 cpu_id 54
    bindprocessor successful: thread_self() 2461887 cpu_id 55
    bindprocessor successful: thread_self() 2830447 cpu_id 10
    bindprocessor successful: thread_self() 2822251 cpu_id 5
    bindprocessor successful: thread_self() 2826349 cpu_id 15
    bindprocessor successful: thread_self() 2855033 cpu_id 3 Starting Initialization
     Done With Initialization
     a(1) 1.00000000000000000
     b(M) 1.00000000000000000
     c(M) 1.00000000000000000
     Incremental Offset = 512
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267914240
     Offset = 0
     The total memory requirement is 6132 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20269 microseconds
        (= 20269 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 158382.48 .03 .03 .03
    Scale: 152918.74 .03 .03 .03
    Add: 169487.13 .04 .04 .04
    Triad: 173851.94 .04 .04 .04
     Sum of a is = 406877816418750.000
     Sum of b is = 81375563283750.0000
     Sum of c is = 108500751045000.000
     Incremental Offset = 1536
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267914240
     Offset = 0
     The total memory requirement is 6132 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20562 microseconds
        (= 20562 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 158275.12 .03 .03 .03
    Scale: 154048.14 .03 .03 .03
    Add: 170215.60 .04 .04 .04
    Triad: 174275.48 .04 .04 .04
     Sum of a is = 406877816418750.000
     Sum of b is = 81375563283750.0000
     Sum of c is = 108500751045000.000
     Incremental Offset = 2560
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267914240
     Offset = 0
     The total memory requirement is 6132 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20980 microseconds
        (= 20980 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 157782.03 .03 .03 .03
    Scale: 152296.98 .03 .03 .03
    Add: 168906.49 .04 .04 .04
    Triad: 174002.25 .04 .04 .04
     Sum of a is = 406877816418750.000
     Sum of b is = 81375563283750.0000
     Sum of c is = 108500751045000.000
     Incremental Offset = 512
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267912192
     Offset = 0
     The total memory requirement is 6132 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20865 microseconds
        (= 20865 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 156829.81 .03 .03 .03
    Scale: 152168.21 .03 .03 .03
    Add: 169285.82 .04 .04 .04
    Triad: 173655.83 .04 .04 .04
     Sum of a is = 406874706018750.000
     Sum of b is = 81374941203750.0000
     Sum of c is = 108499921605000.000
     Incremental Offset = 1536
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267912192
     Offset = 0
     The total memory requirement is 6132 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20919 microseconds
        (= 20919 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 158176.44 .03 .03 .03
    Scale: 153812.38 .03 .03 .03
    Add: 169687.38 .04 .04 .04
    Triad: 174567.44 .04 .04 .04
     Sum of a is = 406874706018750.000
     Sum of b is = 81374941203750.0000
     Sum of c is = 108499921605000.000
     Incremental Offset = 2560
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267912192
     Offset = 0
     The total memory requirement is 6132 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20027 microseconds
        (= 20027 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 156806.56 .03 .03 .03
    Scale: 153200.32 .03 .03 .03
    Add: 169332.59 .04 .04 .04
    Triad: 173341.07 .04 .04 .04
     Sum of a is = 406874706018750.000
     Sum of b is = 81374941203750.0000
     Sum of c is = 108499921605000.000
     Incremental Offset = 512
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267910144
     Offset = 0
     The total memory requirement is 6131 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20166 microseconds
        (= 20166 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 158603.60 .03 .03 .03
    Scale: 152932.01 .03 .03 .03
    Add: 169236.72 .04 .04 .04
    Triad: 173923.28 .04 .04 .04
     Sum of a is = 406871595618750.000
     Sum of b is = 81374319123750.0000
     Sum of c is = 108499092165000.000
     Incremental Offset = 1536
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267910144
     Offset = 0
     The total memory requirement is 6131 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20604 microseconds
        (= 20604 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 157792.08 .03 .03 .03
    Scale: 153395.21 .03 .03 .03
    Add: 169663.66 .04 .04 .04
    Triad: 173924.40 .04 .04 .04
     Sum of a is = 406871595618750.000
     Sum of b is = 81374319123750.0000
     Sum of c is = 108499092165000.000
     Incremental Offset = 2560
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267910144
     Offset = 0
     The total memory requirement is 6131 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20299 microseconds
        (= 20299 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 156556.85 .03 .03 .03
    Scale: 153242.24 .03 .03 .03
    Add: 169466.43 .04 .04 .04
    Triad: 173797.75 .04 .04 .04
     Sum of a is = 406871595618750.000
     Sum of b is = 81374319123750.0000
     Sum of c is = 108499092165000.000
     Incremental Offset = 512
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267908096
     Offset = 0
     The total memory requirement is 6131 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20354 microseconds
        (= 20354 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 157714.75 .03 .03 .03
    Scale: 152767.11 .03 .03 .03
    Add: 169418.29 .04 .04 .04
    Triad: 173775.14 .04 .04 .04
     Sum of a is = 406868485218750.000
     Sum of b is = 81373697043750.0000
     Sum of c is = 108498262725000.000
     Incremental Offset = 1536
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267908096
     Offset = 0
     The total memory requirement is 6131 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20715 microseconds
        (= 20715 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 158698.98 .03 .03 .03
    Scale: 153576.16 .04 .03 .06
    Add: 169291.73 .04 .04 .04
    Triad: 173827.79 .08 .04 .16
     Sum of a is = 406868485218750.000
     Sum of b is = 81373697043750.0000
     Sum of c is = 108498262725000.000
     Incremental Offset = 2560
     Number of Threads = 64
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 267908096
     Offset = 0
     The total memory requirement is 6131 MB
     You are running each test 5 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 20393 microseconds
        (= 20393 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 156882.15 .03 .03 .03
    Scale: 153576.16 .03 .03 .03
    Add: 169742.46 .04 .04 .04
    Triad: 173661.00 .04 .04 .04
     Sum of a is = 406868485218750.000
     Sum of b is = 81373697043750.0000
     Sum of c is = 108498262725000.000
    6
    bindprocessor successful: thread_self() 2470083 cpu_id 37
    bindprocessor successful: thread_self() 2801763 cpu_id 2
    bindprocessor successful: thread_self() 2871425 cpu_id 31
    bindprocessor successful: thread_self() 2723907 cpu_id 18
    bindprocessor successful: thread_self() 2781279 cpu_id 19
    bindprocessor successful: thread_self() 2736197 cpu_id 56
    bindprocessor successful: thread_self() 2777177 cpu_id 57
    bindprocessor successful: thread_self() 2760789 cpu_id 7
    bindprocessor successful: thread_self() 2814057 cpu_id 6
    bindprocessor successful: thread_self() 2687021 cpu_id 9
    bindprocessor successful: thread_self() 2699315 cpu_id 8
    bindprocessor successful: thread_self() 2511067 cpu_id 0
    bindprocessor successful: thread_self() 2539749 cpu_id 47
    bindprocessor successful: thread_self() 2482377 cpu_id 46
    bindprocessor successful: thread_self() 2552043 cpu_id 53
    bindprocessor successful: thread_self() 2465987 cpu_id 52
    bindprocessor successful: thread_self() 2547945 cpu_id 43
    bindprocessor successful: thread_self() 2719807 cpu_id 42
    bindprocessor successful: thread_self() 2556147 cpu_id 23
    bindprocessor successful: thread_self() 2600973 cpu_id 22
    bindprocessor successful: thread_self() 2797663 cpu_id 27



    This archive was generated by hypermail 2.1.4 : Wed Nov 03 2004 - 08:05:58 CST