[stream] The importance of using OFFSET in STREAM

From: David T. Wang (davewang@wam.umd.edu)
Date: Fri May 07 2004 - 14:10:12 CDT

  • Next message: Petr Pospíšil: "[stream] STREAM benchmark"

    The Problem:

    I had previously submitted some STREAM bandwidth results on
    a Dell PowerEdge 400SC. Recently, as I was going through some notes, I
    realized that I had not "tuned" the software to perform optimally on the
    hardware. Specifically, by setting the OFFSET in STREAM to 0, the address
    boundaries were perfectly aligned as to cause the read and write streams
    of A[], B[], and C[] to be mapped to the same DRAM bank. It turns out
    that by picking the correct offset, STREAM triad and add scores can be
    increased by ~20%.

    *********************************************************************

    References:

    Lin, Reinhardt, Burger, "Reducing DRAM latencies with an Integrated Memory
    Hierarchy Design" HPCA 7, Jan 2001

    Zhang, Zhu, Zhang, "Breaking Address Mapping Symmetry at Multi-levels of
    Memory Hiearchy to Reduce DRAM Row-buffer Conflicts", Journal of
    Instruction Level Parallelism, Vol 3, 2002.

    Intel i875P chipset datasheet.

    *********************************************************************

    Optimiaztion Effort:

    It turns out that For the 875P chipset, the bank id is placed on physical
    address bits 14 and 15 for 256 Mbit DRAM chips. So all arrays larger than
    2^16 bytes needs to be offset by 2^14 bytes to try to get the arrays to be
    mapped to different banks. (Assume that virtual address to physical
    address ends up being contiguous, which as it turns out, wasn't a bad
    assumption)

    So, for STREAM, the offset needs to be 2^14 bytes. Since sizeof(double)
    is 8 bytes, OFFSET needs to be 2^11 in this system configuration.

    **********************************************************************

    System Configuration:

    Dell PowerEdge 400SC 2.8GHz Pentium 4 in HT mode. Mandrake Linux
    800 Mbps FSB, "dual channel" PC3200 DDR SDRAM peak BW = 6400 MB/s

    2 Kingston KVR400X72C3A/512 DIMMs ECC 3-3-3 timing.

    **********************************************************************

    The Result:

    wowbagger:~/stream: ./stream_nooffset
    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 8388608, Offset = 0
    Total memory required = 192.0 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Your clock granularity appears to be less than one microsecond.
    Each test below will take on the order of 44954 microseconds.
       (= -2147483648 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 2331.2204 0.0583 0.0576 0.0593
    Scale: 2473.4660 0.0547 0.0543 0.0552
    Add: 2584.2938 0.0789 0.0779 0.0813
    Triad: 2496.4570 0.0816 0.0806 0.0826
    wowbagger:~/stream: vi stream_d.c
    wowbagger:~/stream: cc -O stream_d.c second_wall.c -o stream_offset -lm
    wowbagger:~/stream: ./stream_offset
    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 8388608, Offset = 2048
    Total memory required = 192.0 MB.
    Each test is run 10 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Your clock granularity appears to be less than one microsecond.
    Each test below will take on the order of 44153 microseconds.
       (= -2147483648 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 2448.4653 0.0553 0.0548 0.0563
    Scale: 2474.4657 0.0549 0.0542 0.0559
    Add: 3164.7217 0.0642 0.0636 0.0657
    Triad: 3157.9160 0.0641 0.0638 0.0655

    ***************************************************************
    Summary:

    A simple OFFSET in this case placed the arrays in different DRAM banks,
    and STREAM_add as well as STREAM_triad bandwidth increased by 600+ MB/s.



    This archive was generated by hypermail 2.1.4 : Mon Jun 21 2004 - 08:35:52 CDT