>>In article <firstname.lastname@example.org> email@example.com (Gary Klaassen) writes:
>>>Several others have pointed out that TFP Power Challenge will
>>>incorporate a streaming cache capable of supplying data to the
>>>floating point units at a rate of 21.6 GBytes per second. This is
>>>presumably the rate between the cache and FPU, but what about the rate
>>>between memory and cache?
>There are two levels of cache in the current machine (the "Challenge"),
>but I am told that the "Power Challenge" will bypass the first-level
>cache for FP operands.
This jives with what John Mashey told me. The Power Challenge Icache will
be 32KB, the Dcache is 16KB, and the secondary external cache will
have 2-16MB and a direct line to the FPU.
>I do not know the exact specs for the primary cache interface on the
>Challenge machines, except that the cache line is 16 bytes. Assuming
There are 3 caches, 16KB Instruction, 16KB data and 1MB external.
The filed engineer told me they have 10ns RAM.
>a second-level cache hit has a latency of 3 cycles (chosen out of a
>nearby hat) and a 128-bit wide interface, then the transfer time is
>4 cycles, and the bandwidth is 300 MB/s. Since the transfer time
>is much smaller than the latency, this throughput estimate is roughly
>a linear function of the latency. Any better numbers from SGI?
>The traffic between secondary cache and main memory goes over the main
>system bus with a peak throughput of 1.2 GB/s. Unfortunately, a single cpu
>can only get about 1/20 of this performance via the cache interface. The
>problem is that the latency is about 55-60 cycles (at 75 MHz). The
>secondary cache lines are 128 bytes, and the bus width is 256 bits. So the
>cache miss time is 60 cycles latency plus 4 cycles for the actual data
>transfer. This gives a peak throughput of about 90 MB/s (note that the
>bus clock is about 48 MHz, while the cpu external clock is 75 MHz).
Yeah, tell me about it. We bought one of these Challenge L's only
to discover it's memory throughput is more or less the same as an Indigo.
This is of course because the memory subsystem is basically the same
as an Indigo (some differences for SMP). Foir some reason they negelcted
to mention this. Moral: don't trust marketing numbers.
>Observed throughput from my "STREAM" benchmark shows about 60 MB/s on
>a single-cpu system. Results are available by anonymous ftp to
>perelandra.cms.udel.edu in bench/stream/.
I am curious as to why the IBM 580 numbers are so much higher? How much
of this is due to the memory bandwidth and how much is due to the
superscalar nature of the RS6000 cpu? The benchmarks you chose are well
suited to RS6000 superscalar and would disadvantage some of the other
chips in the table.
>Since this throughput is dominated by latency, presumably faster cache
>controller hardware in future revisions will run faster. The TFP chip
>will be able to cut the latency immediately by bypassing the first-
>level cache for FP operands.
I certainly hope so.
Prof. G. P. Klaassen
Dept. of Earth and Atmospheric Science
York University, North York, Ontario, Canada M3J 1P3
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:02 CDT