Re: STREAM results for Apple G4/400 system ii (w/AltiVec)

From: Anton Rang (rang@trillium.adaptec.com)
Date: Sun Apr 02 2000 - 11:46:56 CDT

  • Next message: Joe Glenski: "Streams results for Cray T3E-1200"

    Hi again,

    Here's the STREAM results for a hand-tuned version running on an Apple G4
    machine. As before, it's a 400 MHz system; it may also be worth noting
    that my test machine has 3-2-2 SDRAM installed.

    The major change between my previous results and this one are that I
    recoded the inner loop into assembly and added use of 'dcbz'. The prefetch
    strategy is similar to the previous, but the exact parameters used may have
    been improved. (Empirically these settings seem to work a little better,
    though theoretically there shouldn't really be any difference. I wish I
    knew exactly how the data streaming was implemented.)

    Looking at these, the improvement is reasonably good for what after all is
    a memory-bandwidth limited test. Presumably most if not all of this is due
    to using dcbz.

      Copy: 19%
      Scale: 17%
      Add: 9%
      Triad: 12%

    The results fluctuate more than I'd like from run to run, even when bumping
    up the array size or number of runs. I suspect some hidden state (TLBs,
    perhaps) is affecting the results.

    I don't know why the 'add' test runs more slowly than 'triad'. This is
    quite counterintuitive as the inner loop is exactly the same except that
    triad is using multiply-adds, which are somewhat slower than additions.
    The simulator I'm using claims each should run in the same number of clock
    cycles (the extra latency of the multiplication is hidden by memory
    stalls), but real life is different.

    I've attached my hard-coded loops. (I modified the main source to just
    call these procedures.)

    Thanks again!

    -- Anton

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.
    -------------------------------------------------------------
    Array size = 400000, Offset = 0
    Total memory required = 9.2 MB.
    Each test is run 20 times, but only
    the *best* time for each is used.
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 10 microseconds.
    Each test below will take on the order of 21117 microseconds.
       (= 2111 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function Rate (MB/s) RMS time Min time Max time
    Copy: 569.6991 0.0115 0.0112 0.0118
    Scale: 545.0055 0.0119 0.0117 0.0123
    Add: 484.1150 0.0200 0.0198 0.0202
    Triad: 497.1517 0.0195 0.0193 0.0198




    --
    This is an unauthorized communication.  "The statements and opinions
    expressed herein are my own and do not necessarily represent those of
    Adaptec."
    



    This archive was generated by hypermail 2b29 : Sat Apr 29 2000 - 11:18:23 CDT