Re: Hardware Prefetching and other CRI T3E goodies (was Re: MIPS R10k/R12k references/URL?)

From: C. Evangelinos (
Date: Fri Mar 19 1999 - 15:13:05 CST

> I don't know the details. I know that there is a barrier tree in
> the router network and that the design came from the T3E folks
> in Chippewa Falls.

I don't have time to recheck but this sounds more like the faster T3D
barrier tree.

> >PS> Have you had any IBM formal submission of STREAM numbers for Model
> > 260? I wouldn't want to submit mine.
> No word from IBM. I have gotten numbers from a number of customers,
> and they are all in the 550-800 MB/s range. I may put them in the
> standard table soon, just to try to get IBM's attention....

Well in that case I'm not breaking any embarrassing secret:

These are the best results I managed to get compiling with flags for
Power3 and a high res PowerPC timer:

xlf and also
-O3 -qarch=pwr3 -qtune=pwr3 1 548.9 545.3 766.0 767.6

Using the standard timer may actually give slightly higher numbers but
the number of timer ticks used is very small.

Following the suggestions in the RS/6000 Scientific and Technical
Computing: POWER3 Introduction and Tuning Guide (that describe how to
tune BLAS/ESSL dcopy()) I transformed the STREAM source code for COPY
and managed to get approximately 862.3MB/s for COPY user-visible
bandwidth(*). This is still below IBM claims (and anyway the standard
STREAM results are supposed to be attainable without hand-tuning *and*
the technique used is so simple the compiler should have it in as a
special case given that copying vectors is very frequently done in

(*) I had to compile with -O2 instead of -O3 as -O3 negated the
effects of the hand-tuning.

Constantinos Evangelinos
Center for Fluid Mechanics
Brown University/Division of Applied Mathematics

This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:08 CDT