For you information, here is how CRAY T3D memory loads work:
All times are for 6.6 nanosecond clock periods. The times are
for transfers of full cache lines (32 bytes or 4 64-bit words).
2 clocks 14 clocks 7-23 clocks
Registers <---- Data Cache <---- Read-ahead Buffer <---- DRAM page
If data is in the same DRAM page (essentially on the same chip),
then the DRAM time is 7 clocks; else 23 clocks. If only one
input stream is needed, then it is safe (and easier) to use a
"read-ahead mode". In read-ahead mode, the next cache line is
automatically loaded into the Read-ahead Buffer. The subsequent
load only needs one clock period to complete the transfer. Note
that read-ahead mode is not what we would want for scalar or
random memory accesses.
Here is how CRAY T3D memory stores work:
The alpha microprocessor has 4 write buffers, each 1 cache line
long. As a program issues stores, the data is moved immediately
to a write buffer which is later written to DRAM memory. If a
write buffer is full (4 words from the same cache line) the EV4
initiates a cache-line store.
If three write buffers are partially full, and a store
instruction is issued which contains a cache-line address which
is not contained in any of the partially full write buffers, then
the microprocessor selects the fourth write buffer for it
store. However, in order to get ready for any subsequent stores
it also writes one of the partially full write buffers to
memory. This hurts overall write bandwidth since the write
buffer is only partially full of data.
Page effects come into play if the various write buffers go to
different DRAM pages. Also, it cost more to switch between pages
The four write buffers work great for one continuous stream. DO loops
with multiple stores are more difficult, partly because of the DRAM
paging and partly because it is difficult to manage the write buffers
so that only full buffers are flushed to memory.
Cray Research, Inc.
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:04 CDT