Each memory port can move two words per cycle, corresponding to the two
pipes. I don't know in the hardware whether it is that there are two pipes
in each of the four ports, or four ports in each pipe, but the effect is the
same. So for the triad, the C-90 theoretical peak is
(2 read +1 write) * 2 pipes * 8 bytes * 16 processors / 4.0e-9 = 192 Gbytes/s.
Similarly for the Y-MP8 (2 +1) * 8 bytes * 8 / 6.0e-9 = 32 Gbytes/s.
Andrew Zachary's results of 26.802 Gbyte/s for the Y-MP8 are consistent
with that - the loss is because of vector startup and bank conflicts, etc.
I did the autotasking tests on our Y-MP2, but could not get consistent
results because the machine was busy (measuring elapsed rather than CPU time).
I think the next test to do is to check the results with the arrays
aligned differently, i.e. a(i) = b(i+k) for various k - my results show a
factor of two difference with different k's. Might I suggest making the arrays
of length 2**20, and putting them in common so that there is a consistent
alignment of the start of each array in the same bank.
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:01 CDT