STREAM results for DDR Athlon and interleaved SDRAM P3 machines

From: tls@reefedge.com
Date: Thu Aug 30 2001 - 19:10:25 CDT

  • Next message: John D Mccalpin: "Re: Stream Benchmark for p680"

    I hope you do want these results sent directly to you -- I couldn't find
    any indication on the STREAM web page as to where to send them to!

    Here are some STREAM results from three new machines we picked up where
    I work. I don't have any fancy compilers, so I'm sure these are somewhat
    lower than the numbers one could hit by using fancy FPU instructions to
    copy (e.g. SSE on the P-III machines or 3DNow on the Athlons). Everything
    was compiled with gcc 2.91; flags were -O3 -fomit-frame-pointer
    -fstrict-aliasing -mcpu=pentiumpro -march=pentiumpro -static. The
    machines weren't entirely idle so I used second_cpu.c.

    Machine 1 is an Athlon 1333MHz (266MHz FSB) with AMD 761 memory controller
    and 512MB of 133MHZ DDR SDRAM ("PC2100" memory). ECC is turned *off* due
    to motherboard limitations. The memory is unbuffered. The motherboard is
    an Asus A7M266.

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 10000000, Offset = 0
    Total memory required = 228.9 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 9999 microseconds.
    Each test below will take on the order of 249999 microseconds.
       (= 25 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 941.1765 0.1852 0.1700 0.2000
    Scale: 592.5926 0.2851 0.2700 0.3000
    Add: 727.2727 0.3462 0.3300 0.3600
    Triad: 685.7143 0.3500 0.3500 0.3500

    Machine 2 is very similar to Machine 1, except that its DDR SDRAM is
    registered (Machine 1's memory is unbuffered) and ECC is turned on.
    The motherboard is an ABIT KG7; again, the memory controller is AMD 761.
    This board doesn't flip out when I turn ECC on, thus it is on.

    Interestingly, the results are almost the same; consistently *better*
    on the Triad kernel. I'd expect a performance drop from having ECC
    turned on, as well as from using registered memory, but I guess the
    principal impact of each is on latency, not throughput.

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 10000000, Offset = 0
    Total memory required = 228.9 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 9999 microseconds.
    Each test below will take on the order of 259999 microseconds.
       (= 26 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 941.1765 0.2012 0.1700 0.2800
    Scale: 592.5926 0.2861 0.2700 0.3000
    Add: 727.2727 0.3492 0.3300 0.3700
    Triad: 727.2727 0.3511 0.3300 0.3600

    Machine 3 is a dual-processor 1GHz Pentium III on a SuperMicro 370DLE
    motherboard; the memory controller is integrated into the ServerWorks III
    chipset. The memory is 133MHz SDRAM; ServerWorks III does interleaving
    if possible (though only 2-way, I believe); this machine has memory in all
    banks. I think we're seeing pin bandwidth limitations of the CPU here;
    interleaved SDR SDRAM *ought* to have about the same throughput as DDR
    SDRAM at the same clock, right?

    I ran only one copy of STREAM at a time on this machine. I hope that was
    the correct thing to do.

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 10000000, Offset = 0
    Total memory required = 228.9 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 9999 microseconds.
    Each test below will take on the order of 330000 microseconds.
       (= 33 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 390.2439 0.4100 0.4100 0.4100
    Scale: 400.0000 0.4070 0.4000 0.4100
    Add: 510.6383 0.4770 0.4700 0.4800
    Triad: 358.2090 0.6750 0.6700 0.6800



    This archive was generated by hypermail 2b29 : Wed Oct 31 2001 - 11:26:46 CST