[stream] STREAM benchmark

From: Petr Pospíšil (pospisil@vc.cvut.cz)
Date: Tue May 11 2004 - 02:30:33 CDT

  • Next message: doug.oflaherty@amd.com: "FW: STREAMS on UVA site"

     
    Dear Dr. McCalpin,

    I am sending results from our 16 processor SGI_Altix_3700 (1.3 GHz Itanium 2, 3 MB L3 cache). The results are of the same order as the ones from the bottom of your table - SGI_Altix_350:
    (I used ecc -O3 -o streamomp -openmp stream_d_omp.c second_wall.c)

    -------------------------------------------------------------
    This system uses 8 bytes per DOUBLE PRECISION word.

    -------------------------------------------------------------

    Array size = 50000000, Offset = 0

    Total memory required = 1144.4 MB.

    Each test is run 1 times, but only

    the *best* time for each is used.

    -------------------------------------------------------------

    Your clock granularity/precision appears to be 1 microseconds.

    Each test below will take on the order of 21175 microseconds.

    (= 21175 clock ticks)

    Increase the size of the arrays if this shows that

    you are not getting at least 20 clock ticks per test.

    -------------------------------------------------------------

    WARNING -- The above is only a rough guideline.

    For best results, please be sure you know the

    precision of your system timer.

    -------------------------------------------------------------

    Function Rate (MB/s) RMS time Min time Max time

    Copy: 26364.3472 0.0303 0.0303 0.0303

    Scale: 27058.0619 0.0296 0.0296 0.0296

    Add: 31517.7547 0.0381 0.0381 0.0381

    Triad: 31818.4191 0.0377 0.0377 0.0377

    I am a beginner in OpenMp and try also other programs. When I cannot find match in performances in analogous programs, I found that performance is more than 10 times worse when the input data are not initialized using omp pragmas. I commented pragmas when initializing the vectors (I enclosed the program) and obtained:

    -------------------------------------------------------------

    This system uses 8 bytes per DOUBLE PRECISION word.

    -------------------------------------------------------------

    Array size = 50000000, Offset = 0

    Total memory required = 1144.4 MB.

    Each test is run 50 times, but only

    the *best* time for each is used.

    -------------------------------------------------------------

    Your clock granularity/precision appears to be 1 microseconds.

    Each test below will take on the order of 243007 microseconds.

    (= 243007 clock ticks)

    Increase the size of the arrays if this shows that

    you are not getting at least 20 clock ticks per test.

    -------------------------------------------------------------

    WARNING -- The above is only a rough guideline.

    For best results, please be sure you know the

    precision of your system timer.

    -------------------------------------------------------------

    Function Rate (MB/s) RMS time Min time Max time

    Copy: 1787.8257 0.4772 0.4475 0.4789

    Scale: 1668.5131 0.4805 0.4795 0.4818

    Add: 1714.3323 0.7011 0.7000 0.7036

    Triad: 1722.7615 0.6979 0.6966 0.6994

    Could you be so kind and write me what coul be the cause? Thank you very much.

    STREAM EDITOR's NOTE: This performance difference is caused by the way that the SGI Altix (and Origin2000 and Origin3000) distribute memory in the system. The memory is physically distributed around the machine -- in the case of this 16-cpu system, there are 8 "nodes", each of which has a separate memory controller and each of which manages 1/8 of the total physical memory (in the usual homogeneously populated configuration). When a program runs and allocates memory, the default mode of operation on these systems is called the "first touch" algorithm. When this mode is active, memory is allocated in the physical memory closest to the processor running the thread that first references a particular memory location. If the OpenMP initialization directives are removed, then the data is all initialized by the master thread and all of the data is placed on the node where the master thread is located. Then when the main loops of the STREAM benchmark run, all of the processors in the system are trying to read and write to the memory of that one node. This variant of the test stresses both the memory controller of the node where the data is located as well as the NUMAlink interconnect between that node and the rest of the system. Under "normal" circumstances, the STREAM benchmark makes negligible use of the NUMAlinks between nodes -- the overwhelming majority of memory accesses are intended to be "local" to each node performing the loads and stores. The OpenMP initialization of the STREAM benchmark is intended to distribute the data across the nodes in the same way that it will later be used in order to maximize performance.
    Measuring the memory bandwidth of a single node is interesting, and measuring the performance of the NUMAlink is interesting, (and you are of course welcome to use STREAM to explore these issues), but I publish these numbers on the STREAM web site in order to show the best bandwidth that a system can sustain using ordinary code.
    Hope this helps.
    Dr. Bandwidth


    Dr. Petr Pospisil

    Computing and Information Centre, Czech Technical University, Prague





    This archive was generated by hypermail 2.1.4 : Mon Jun 21 2004 - 08:35:58 CDT