Re: STREAM for T3D

From: Charles Grassl (cmg@ferrari.cray.com)
Date: Tue Jan 31 1995 - 10:07:02 CST

Next message: weyrich@compt.chemie.uni-konstanz.de: "Re: I have an idea about stream_wall.f on PCh"
Previous message: weyrich@compt.chemie.uni-konstanz.de: "N and hinv"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

John;

For you information, here is how CRAY T3D memory loads work:

All times are for 6.6 nanosecond clock periods. The times are
for transfers of full cache lines (32 bytes or 4 64-bit words).

2 clocks 14 clocks 7-23 clocks

Registers <---- Data Cache <---- Read-ahead Buffer <---- DRAM page

      If data is in the same DRAM page (essentially on the same chip),
      then the DRAM time is 7 clocks; else 23 clocks. If only one
      input stream is needed, then it is safe (and easier) to use a
      "read-ahead mode". In read-ahead mode, the next cache line is
      automatically loaded into the Read-ahead Buffer. The subsequent
      load only needs one clock period to complete the transfer. Note
      that read-ahead mode is not what we would want for scalar or
      random memory accesses.

Here is how CRAY T3D memory stores work:

      The alpha microprocessor has 4 write buffers, each 1 cache line
      long. As a program issues stores, the data is moved immediately
      to a write buffer which is later written to DRAM memory. If a
      write buffer is full (4 words from the same cache line) the EV4
      initiates a cache-line store.

      If three write buffers are partially full, and a store
      instruction is issued which contains a cache-line address which
      is not contained in any of the partially full write buffers, then
      the microprocessor selects the fourth write buffer for it
      store. However, in order to get ready for any subsequent stores
      it also writes one of the partially full write buffers to
      memory. This hurts overall write bandwidth since the write
      buffer is only partially full of data.

      Page effects come into play if the various write buffers go to
      different DRAM pages. Also, it cost more to switch between pages
      for stores.

The four write buffers work great for one continuous stream. DO loops
with multiple stores are more difficult, partly because of the DRAM
paging and partly because it is difficult to manage the write buffers
so that only full buffers are flushed to memory.

Regards,
Charles Grassl
Cray Research, Inc.

Next message: weyrich@compt.chemie.uni-konstanz.de: "Re: I have an idea about stream_wall.f on PCh"
Previous message: weyrich@compt.chemie.uni-konstanz.de: "N and hinv"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:04 CDT