stream: P4 with streaming stores (experimental)

From: H.W. Stockman (
Date: Fri Feb 16 2001 - 17:15:59 CST

  • Next message: Herbert Huber: "P4_1500 results"

    "Experimental" DP Stream for P4 1.4 GHz, using
    streaming stores, average 10 runs:

    copy_ 2102.07 MB/s
    scale 2101.54 MB/s
    add__ 2111.31 MB/s
    triad 2103.53 MB/s

    All under Windows 98 se; basic system services and
    Intel system monitor running in background.
    System: 1.4 GHz Pentium4 with 1 GB PC800 RDRAM,
    in Intel i850 board (homemade system).
    All 64 RDRAM devices taken, so this is not an ideal
    system r.e. latency.

    Hard disk is 5400 rpm 30 GB Maxtor, ATA/66

    Executable: Compiled with Intel C/C++ 5.0, under MS
    VC++ 6.0:
    /ML /W2 /GX /D "WIN32" /D "NDEBUG" /D
    "_CONSOLE" /FA /Fa"Release/"
    /Fp".\Release\stream.pch" /YX /Fo".\Release/"
    /Fd".\Release/" /FD /G7 /O3 -Qrestrict -QxW
    .only /G7 /O3 -QxW are relevant; the codes
    has no restricts.

    Ran 10 times in sequence (from batch file), and
    averaged all 10 for results reported above.

    (1) Modified source. I added calls to a simple validation
    function, similar to that provided with FORTRAN
    source, and I used a second() function provided by a
    previous poster (my apologies, I've forgotten his
    name). The automatic vectorizer would handle every
    part of the translation except the streaming stores.
    The current, or future versions of vtune, may add
    streaming stores automatically, which would remove
    the need to call the results "experimental". I used
    the Intel intrinsic functions, to avoid fiddling with
    assembly language; for this version, no attempt was
    made to use prefetch instructions.
    (2) Executable, for P4.
    (3) GIF file showing variation with run number.
    (4) Batch file to run 10 times, and output of 10 runs


    This archive was generated by hypermail 2b29 : Mon Apr 23 2001 - 09:29:53 CDT