Standard and Tuned STREAM on IBM eServer OpenPower 720

From: Duc Vianney (dvianney@us.ibm.com)
Date: Sun Sep 12 2004 - 15:42:49 CDT

  • Next message: Frank Johnston: "standard STREAM on IBM eServer p5 520 Express (1500 MHz, 2 cpu)"
    -------- STREAM Editor's Note 2004-09-13 -----------
    These results are listed in the STREAM tables as using 4 "processors", while the results below show 8 "threads".

    Here is the story:
    The system contains two POWER5 chips, each with two "processor cores", for a total of four physical "processor cores". Each of these POWER5 "processor cores" is capable of simultaneously executing two "threads". When running in this "Simultaneous Multi-Threaded" (SMT) mode, the system appears to have twice as many "cpus". I call these entities "logical processors" to avoid confusion. The results below employed a total of 8 OpenMP threads running on the 8 "logical processors" that were running on the 4 "physical processors".

    (I hope that is clear -- I cannot think of any other nomenclature that is less confusing.... Since I am a member of the IBM POWER5 design team, sometimes it is hard for me to step back and understand how other people use these words....)

    It is inevitable that as computers get more complex, it will become more and more difficult to invent a labelling scheme that is not confusing or misleading.

    The good news is that in this particular case, the results are only weakly dependent on the number of threads used. The IBM eServer p5 550 contains nearly identical hardware and gets its slightly better STREAM results using large (16 MB) pages with 4 OpenMP threads. When using the default small (4 kB) page size in Linux or AIX, more threads are needed to get enough concurrent outstanding cache misses to reach these asymptotic bandwidth results.

    Comments and suggestions are welcome....
    --------- end of STREAM Editor's Note 2004-09-13 ------------

    Herein are Standard and Tuned STREAM results on IBM eServer OpenPower 720
    (1650 MHz, 4CPU, Linux).

    IBM eServer OpenPower 720 (1650 MHz, 4CPU, Linux)
    Standard STREAM submission

     Number of Threads = 8
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 66060288
     Offset = 96
     The total memory requirement is 1512 MB
     You are running each test 100 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 176185 microseconds
        (= 176185 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 5874.6379 .1894 .1799 .6120
    Scale: 5783.2571 .1831 .1828 .1929
    Add: 7439.3970 .2135 .2131 .2137
    Triad: 7531.6010 .2108 .2105 .2110
     Sum of a is = 0.537150969562697781E+126
     Sum of b is = 0.107430193912641460E+126
     Sum of c is = 0.143240258546879354E+126
    locking to cpu 0
    locking to cpu 1
    locking to cpu 2
    locking to cpu 3
    locking to cpu 4
    locking to cpu 5
    locking to cpu 6
    locking to cpu 7

    IBM eServer OpenPower 720 (1650 MHz, 4CPU, Linux)
    Tuned STREAM submission

     Number of Threads = 8
    ----------------------------------------------
     Double precision appears to have 16 digits of accuracy
     Assuming 8 bytes per DOUBLE PRECISION word
    ----------------------------------------------
     Array size = 66060288
     Offset = 96
     The total memory requirement is 1512 MB
     You are running each test 100 times
     The *best* time for each test is used
     ----------------------------------------------------
     Your clock granularity/precision appears to be 1 microseconds
     The tests below will each take a time on the order
     of 176451 microseconds
        (= 176451 clock ticks)
     Increase the size of the arrays if this shows that
     you are not getting at least 20 clock ticks per test.
     ----------------------------------------------------
     WARNING -- The above is only a rough guideline.
     For best results, please be sure you know the
     precision of your system timer.
     ----------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 6153.7267 .1831 .1718 .6521
    Scale: 6014.3981 .1759 .1757 .1772
    Add: 8610.8571 .1844 .1841 .1860
    Triad: 8801.5882 .1804 .1801 .1807
     Sum of a is = 0.537150969562697781E+126
     Sum of b is = 0.107430193912641460E+126
     Sum of c is = 0.143240258546879354E+126
    locking to cpu 0
    locking to cpu 1
    locking to cpu 2
    locking to cpu 3
    locking to cpu 4
    locking to cpu 5
    locking to cpu 6
    locking to cpu 7

    Memory Info:
    MemTotal: 32163332 kB
    MemFree: 31981016 kB
    Buffers: 24416 kB
    Cached: 51656 kB
    SwapCached: 0 kB
    Active: 50408 kB
    Inactive: 37492 kB
    HighTotal: 0 kB
    HighFree: 0 kB
    LowTotal: 32163332 kB
    LowFree: 31981016 kB
    SwapTotal: 1048568 kB
    SwapFree: 1048568 kB
    Dirty: 144 kB
    Writeback: 0 kB
    Mapped: 15144 kB
    Slab: 32596 kB
    Committed_AS: 91820 kB
    PageTables: 632 kB
    VmallocTotal: 2147483647 kB
    VmallocUsed: 4040 kB
    VmallocChunk: 2147479015 kB
    HugePages_Total: 0
    HugePages_Free: 0
    Hugepagesize: 16384 kB

    CPU Info:
    processor : 0
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    processor : 1
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    processor : 2
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    processor : 3
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    processor : 4
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    processor : 5
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    processor : 6
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    processor : 7
    cpu : POWER5 (gr)
    clock : 1656.000000MHz
    revision : 2.1

    timebase : 207000000
    machine : CHRP IBM,9124-720

    Operating System Info:
    SUSE LINUX Enterprise Server 9 for IBM POWER
    Linux version 2.6.5-7.97-pseries64 (geeko@buildhost) (gcc version 3.3.3
    (SuSE Linux)) #1 SMP Fri Jul 2 14:21:59 UTC 2004

    Compiler Info:
    XL Fortran Enterprise Edition Version 9.1 for Linux

    Regards .. Duc.

    Duc J Vianney, Ph. D., IBM Linux Technology Center Performance Team
    dvianney@us.ibm.com, Phone: (512) 838-9919 Fax: (512) 838-0070
    home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
    project page: http://www-124.ibm.com/developerworks/projects/linuxperf



    This archive was generated by hypermail 2.1.4 : Mon Sep 13 2004 - 08:37:09 CDT