memory bandwidth

From: Alex Vasilevsky (alex@Think.COM)
Date: Tue Nov 05 1991 - 13:43:10 CST


   Date: Tue, 5 Nov 91 13:51:44 EST
   From: "John D. McCalpin" <mccalpin@perelandra.cms.udel.edu>

   I was looking over the memory bandwidth results that you provided
   for the CM-2, and had a question I hope you could answer.
   The Copy and Scale operations ran at about 56 GB/s on the 64-K machine.
   This implies a memory bandwidth limitation of
           4 Bytes/clock/fpu * 2048 fpus * 7 MHz = 57.3 GB/s
   The Sum and Triad operations ran at about 80 GB/s, which implies
   that there is more data bandwidth available. Do the fpus have more
   than one independent 32-bit data path?
   Thanks for any explanation!
   --
   John D. McCalpin mccalpin@perelandra.cms.udel.edu
   Assistant Professor mccalpin@brahms.udel.edu
   College of Marine Studies, U. Del. DELOCN::MCCALPIN (SPAN)

No. It is just that the Fortran compiler is very clever and generates the
best possible sequence of code for these cases. By looking at the timings,
even though the machine is doing an extra load, the timing is not much
different from the code doing 2 loads, the difference is about a 5%. I see
a very similar thing going on the Cray.

CM2 64K
--------
gorka(test)% stream
-------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLEPRECISION word
-------------------------------------
Calibrating CM timer...Done. CM speed = 7.00 MHz
Timing calibration ; time = 143.324911594391 hundredths of a second
Increase the size of the arrays if this is <30 and your clock precision is =<1
/100 second
---------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment 56555.9654 .0231 .0231 .0231
Scaling 56555.9654 .0231 .0231 .0231
Summing 81244.0132 .0242 .0242 .0242
SAXPYing 80593.2540 .0244 .0244 .0244

Here are the two DO loops and the vector code that our compiler generates:

First DO loop:

         DO 40 j = 1,n
            c(j) = a(j) + b(j)
 40 CONTINUE

Code generated by compiler, every instruction is a vector instruction here.

procedure _stream_pe_code_3
L1$_stream_pe_code_3:
                                                                    
        popq aC2
        popa SP
        # Get address of A
        popa aP2
        # Get address of B
        popa aP3
        # Get address of C
        popa aP4

L2$_stream_pe_code_3:

        dflodv [aP3+0]2++ aV0
        # "stream_d.fcm" line 102
        # C = A + B
        dfaddv [aP2+0]2++ aV0 aV1
        dfstrv aV1 [aP4+0]2++
        jnz aC2 L2$_stream_pe_code_3
end

Second DO loop:

         DO 50 j = 1,n
            c(j) = a(j) + 3.0D0*b(j)
 50 CONTINUE

Code generated by compiler, every instruction is a vector instruction here.

procedure _stream_pe_code_4
L1$_stream_pe_code_4:

        popq aC2
        popa SP
        # Get address of A
        popa aP2
        # Get address of B
        popa aP3
        # Get address of C
        popa aP4
        # "stream_d.fcm" line 110
        # C = A + 3.0D0*B
        dflodc $3.000000000000000000d+00 aS28
                                                       

L2$_stream_pe_code_4:

        dflodv [aP2+0]2++ aV0
        # "stream_d.fcm" line 110
        # C = A + 3.0D0*B
        dfmuladdv aS28 [aP3+0]2++ aV1 aV0 aV1
        dfstrv aV1 [aP4+0]2++
        jnz aC2 L2$_stream_pe_code_4

end



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:02 CDT