From turner@csrd.uiuc.edu  Sat Jan 23 16:17:25 1993
Received: from s46.csrd.uiuc.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA28124; Sat, 23 Jan 93 16:17:25 -0500
Received: from a7.csrd.uiuc.edu by s46.csrd.uiuc.edu with SMTP id AA10384
  (5.67a/IDA-1.5 for <mccalpin@perelandra.cms.udel.edu>); Sat, 23 Jan 1993 15:18:36 -0600
Received: by a7.csrd.uiuc.edu (4.12/9.2)
	id AA00934; Sat, 23 Jan 93 15:13:35 cst
Date: Sat, 23 Jan 93 15:13:35 cst
From: turner@csrd.uiuc.edu (Steve Turner)
Message-Id: <9301232113.AA00934@a7.csrd.uiuc.edu>
To: mccalpin
Subject: Stream results
Status: RO

Your posting on comp.sys.super piqued my curiosity, so I snarfed the
benchmark and ported it to our machines.  I work for CSRD, and we have
a bunch of Alliant machines here, as well as our own home-brewed
agglomeration of 4 of FX/80s called Cedar.  You probably have heard of
us, so I'll just tell you what I did to get the results and then give
results.

I made two changes to the source code.  First, I replaced the calls to
"second" with calls to the High Resolution Clock timer facility
(hrcget and hrcdelta)  This is a microsecond resolution timer used for
performance evaluation, so I think the results should be accurate.
Second, I made slight changes to the result FORMAT statements, since
Alliant's fortran compiler assume carriage control info is used.
I will send you a copy of the altered source, if you want, but since
the changes were so trivial it doesn't seem necessary.

The results for an FX/80 with 8 processors (~11.75 MHz clock rate)
compiled with Alliant's fortran compiler using only the "-O" option:

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    630.571000000000       hundredths  of a second
 Increase the size of the arrays if this is <30   and your clock precision is =<1/100 second
 ---------------------------------------------------
 Function     Rate (MB/s)  RMS time   Min time  Max time
 Assignment:   72.8155      0.0692      0.0659      0.0739
 Scaling   :   71.5990      0.0700      0.0670      0.0746
 Summing   :   76.2793      0.0971      0.0944      0.1028
 SAXPYing  :   76.5143      0.0998      0.0941      0.1119

----------------


The results for an FX/2800 using a 14 processor "cluster", compiled
with Alliant's fortran compiler using just the "-O" option are:
(sorry, I don't know the clock rate of the i860's) 
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    15.7310000000000      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
 Function     Rate (MB/s)  RMS time   Min time  Max time
 Assignment:  144.6655      0.0342      0.0332      0.0362
 Scaling   :  150.1877      0.0328      0.0320      0.0347
 Summing   :  135.1859      0.0549      0.0533      0.0584
 SAXPYing  :  125.3264      0.0590      0.0575      0.0618
----------------

Since this seemed to run too fast, I bumped up the array size by one
order of magnitude and ran it again.  The "long stream" results are:
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    185.139000000000      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
 Function     Rate (MB/s)  RMS time   Min time  Max time
 Assignment:  309.0394      0.1650      0.1553      0.1882
 Scaling   :  305.9273      0.1658      0.1569      0.1834
 Summing   :  297.8160      0.2528      0.2418      0.2889
 SAXPYing  :  291.9708      0.2620      0.2466      0.3141
----------------

I plan on porting it to Cedar, too, but this will require modification
of the array declarations in order to distribute the arrays to the
global memory.  I'll send details along with the results once I get them.

st

From lfm@pgroup.com  Sat Jan 23 22:08:10 1993
Received: from libby.pgroup.com by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA28286; Sat, 23 Jan 93 22:08:10 -0500
Received: by libby.pgroup.com id AA14392
  (5.65c/IDA-1.4.4 for mccalpin@perelandra.cms.udel.edu); Sat, 23 Jan 1993 19:09:16 -0800
Date: Sat, 23 Jan 1993 19:09:16 -0800
From: Larry Meadows <lfm@pgroup.com>
Message-Id: <199301240309.AA14392@libby.pgroup.com>
To: mccalpin
Subject: stream results
Status: RO


This is for a 40 mhz i860-XR workstation.  I know of an i860-XP based PC
card that gets 400 MB/Sec.  So that makes it faster than any other
workstation on your list, and most of the superminis.  Now if it just had
a superscalar FP unit...

Maybe someone else will run it on the paragon.  Intel wouldn't like it
if I did.

I always knew that that the HP systems only got their performance when
things fit in cache.

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    93.00000000000000      hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  160.0000      0.0333      0.0300      0.0400
Scaling   :  160.0000      0.0363      0.0300      0.0400
Summing   :  120.0000      0.0632      0.0600      0.0700
SAXPYing  :  120.0000      0.0652      0.0600      0.0700

From lfm@pgroup.com  Mon Jan 25 13:55:24 1993
Received: from libby.pgroup.com by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA01609; Mon, 25 Jan 93 13:55:24 -0500
Received: by libby.pgroup.com id AA19798
  (5.65c/IDA-1.4.4 for mccalpin@perelandra.cms.udel.edu); Mon, 25 Jan 1993 10:56:39 -0800
Date: Mon, 25 Jan 1993 10:56:39 -0800
From: Larry Meadows <lfm@pgroup.com>
Message-Id: <199301251856.AA19798@libby.pgroup.com>
To: mccalpin
Subject: Re:  stream results
Status: RO

>>This is for a 40 mhz i860-XR workstation.   
>
>Who makes this?  What is its full model name?

Stardent Vistra, manufactured by Oki Electric, equivalent to an Oki
7300 Model 20.  40 MHZ, 8KB data cache, 4KB instruction cache, 32MB of
memory, running Unix System V Release 4, using The Portland Group's
pgf77 version 2.1.  Now sold by Kubota Computer.

>>I know of an i860-XP based PC
>>card that gets 400 MB/Sec.  So that makes it faster than any other
>>workstation on your list, and most of the superminis.  Now if it just had
>>a superscalar FP unit... 
>
>Is that *really* 400 MB/s from compiled high-level code?
>Is it one of those expensive versions with all SRAM instead of DRAM?

400 MB/Sec from compiled code. Note, however, that the code generation
technique we use for compilation ends up calling an assembly coded routine
to pull the data into cache, so it is running flat out.  Nevertheless,
the technique is generally applicable; it is equivalent to using cache as
vector registers, and the assembly routines (called streamin/streamout
routines), are equivalent to hardware vload/vstore instructions.

They use static column DRAM (like the oki station above).  Don't know
the price.  The specific board I mention is made by Transtech
(try Richard Stevens -- rs@transt.co.uk if you want further information).

lfm

From mnp@Texaco.COM  Mon Apr 19 18:35:36 1993
Received: from Texaco.TEXACO.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA01682; Mon, 19 Apr 93 18:35:36 -0400
Received: by Texaco.COM (4.1/SMI-4.1)
	id AA11387; Mon, 19 Apr 93 17:41:10 CDT
Date: Mon, 19 Apr 93 17:41:10 CDT
From: mnp@Texaco.COM (Mark N. Portney)
Message-Id: <9304192241.AA11387@Texaco.COM>
To: mccalpin
Status: RO

John,

   Thanks for the clear description of c = a + b.

   I do not understand how to predict the SGI performance.  (It doesn't
matter other than my curiousity!)  On the Crimson,
cache miss penalties (both) were quoted as 110 internal cycles with no write
back, and 119 with write back.  These seem consistent with some
measurements.  Apparently these also seem to apply to the Challenge (too
bad).  The latency is so large, it wouldn't matter if the bandwidth were
infinite! :-)  The current Challenge is 100 MHz internal, 50 MHz external, 
47.6 MHz bus.  Cache lines are 16 bytes primary, 128 bytes secondary, and the 
bus (256 bits wide) can deliver data on 4 out of 5 cycles on one transaction,
for (256/8 bytes)*(47.6MHz)*(4/5) = 1.218 GB/s.  (The secondary cache is 1MB).
Next clock goes to 150 MHz, 75MHz, bus still at 47.6 MHz.  I have asked SGI
for an explanation of how the different parts of the system affect the latency,
and how it should change in the future.  If I hear from them, I'll send you 
the information.

   I believe the latency for the IBM 580 is about 15.5 cycles from other
tests I have run, and a TLB miss penalty of 38 cycles.

   I have run streams on several machines here (included below).  The 
Challenge runs better when I compile on the Crimson (? - I'll track this down).
If you would like me to run some small benchmarks for you on the 580 (or 
Challenge), let me know.

                                                Thanks again,
                                                      Mark


stream_d
SGI Crimson R4000 os 4.0.5  ftn 3.10
f77 -O2 -mips2
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    91.99999682605267     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   61.5385      0.2700      0.2600      0.2800
Scaling   :   59.2594      0.2750      0.2700      0.2800
Summing   :   58.5367      0.4140      0.4100      0.4200
SAXPYing  :   59.9999      0.4080      0.4000      0.4100
 
stream_d
SGI Challenge R4400 os 5.0    ftn 3.10
f77 -O2 -mips2
compiled on Challenge
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    122.9999981820583     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   48.4847      0.3410      0.3300      0.3500
Scaling   :   47.0589      0.3451      0.3400      0.3600
Summing   :   50.0000      0.4921      0.4800      0.5000
SAXPYing  :   54.5456      0.4451      0.4400      0.4600
 
stream_d
SGI Challenge R4400 os 5.0  ftn 3.10
f77 -O2 -mips2
compiled on Crimson
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    117.0000027865171     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   57.1429      0.2891      0.2800      0.3000
Scaling   :   55.1724      0.3001      0.2900      0.3100
Summing   :   53.3334      0.4641      0.4500      0.4800
SAXPYing  :   54.5456      0.4450      0.4400      0.4500
 
stream_s
SGI Crimson R4000 os 4.0.5  ftn 3.10
f77 -O2 -mips2
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =    47.00000     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   57.1429      0.1471      0.1400      0.1500
Scaling   :   53.3333      0.1571      0.1500      0.1700
Summing   :   54.5455      0.2270      0.2200      0.2300
SAXPYing  :   52.1739      0.2371      0.2300      0.2500
 
stream_s
SGI Challenge R4400 os 5.0  ftn 3.10
f77 -O2 -mips2
compiled on Challenge
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =    61.00000     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   40.0000      0.2061      0.2000      0.2100
Scaling   :   36.3636      0.2281      0.2200      0.2400
Summing   :   41.3793      0.2981      0.2900      0.3100
SAXPYing  :   48.0000      0.2550      0.2500      0.2600
 
stream_s
SGI Challenge R4400 os 5.0  ftn 3.10
f77 -O2 -mips2
compiled on Crimson
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =    62.00000     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   53.3333      0.1591      0.1500      0.1700
Scaling   :   47.0589      0.1741      0.1700      0.1900
Summing   :   50.0000      0.2440      0.2400      0.2500
SAXPYing  :   48.0000      0.2571      0.2500      0.2700
 
stream_d
SUN SS2 os 4.1.2 f77 SC1.0
f77 -O2 -cg89
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     382.99998268485 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   20.5129      0.8168      0.7800      1.0100
Scaling   :   18.6047      0.8640      0.8600      0.8900
Summing   :   21.8182      1.1482      1.1000      1.5000
SAXPYing  :   22.4299      1.1078      1.0700      1.3400
 
stream_d
SUN SS10/41 os 4.1.3 f77 SC2.0.1
f77 -O2 -cg92
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =     238.33334110677 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   34.2857      0.4868      0.4667      0.5000
Scaling   :   38.4001      0.4267      0.4167      0.4333
Summing   :   36.9231      0.6534      0.6500      0.6667
SAXPYing  :   37.8947      0.6467      0.6333      0.6500
 
stream_s
SUN SS2 os 4.1.2 f77 SC1.0
f77 -O2 -cg89
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     191.000 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   18.6046      0.4411      0.4300      0.4500
Scaling   :   17.7778      0.4640      0.4500      0.4700
Summing   :   19.3548      0.6270      0.6200      0.6300
SAXPYing  :   20.6897      0.6021      0.5800      0.6100
 
stream_s
SUN SS10/41 os 4.1.3 f77 SC2.0.1
f77 -O2 -cg92
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =     131.667 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   36.9232      0.2302      0.2167      0.2500
Scaling   :   34.2857      0.2418      0.2333      0.2500
Summing   :   37.8947      0.3301      0.3167      0.3333
SAXPYing  :   36.0000      0.3401      0.3333      0.3500
 
stream_s
RS/6000-580 xlf 2.03
f77 -O 
 Test #1 Failed = picalc=piexact
 Apparently Single=Double Precision
 Proceeding to Test #2
  
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =   67.00000000      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  177.7778       .1880       .1800       .1900
Scaling   :  118.5185       .2841       .2700       .2900
Summing   :  137.1429       .3551       .3500       .3800
SAXPYing  :  141.1765       .3551       .3400       .3700
--------------------------------------

stream_d
RS/6000-580 xlf 2.03
f77 -O 
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =   133.000000000000000      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  266.6667       .2512       .2400       .2700
Scaling   :  246.1538       .2661       .2600       .2800
Summing   :  234.1463       .4261       .4100       .4500
SAXPYing  :  228.5714       .4352       .4200       .4600

From mnp@Texaco.COM  Mon Apr 19 18:35:36 1993
Received: from Texaco.TEXACO.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA01682; Mon, 19 Apr 93 18:35:36 -0400
Received: by Texaco.COM (4.1/SMI-4.1)
	id AA11387; Mon, 19 Apr 93 17:41:10 CDT
Date: Mon, 19 Apr 93 17:41:10 CDT
From: mnp@Texaco.COM (Mark N. Portney)
Message-Id: <9304192241.AA11387@Texaco.COM>
To: mccalpin
Status: RO

John,

   Thanks for the clear description of c = a + b.

   I do not understand how to predict the SGI performance.  (It doesn't
matter other than my curiousity!)  On the Crimson,
cache miss penalties (both) were quoted as 110 internal cycles with no write
back, and 119 with write back.  These seem consistent with some
measurements.  Apparently these also seem to apply to the Challenge (too
bad).  The latency is so large, it wouldn't matter if the bandwidth were
infinite! :-)  The current Challenge is 100 MHz internal, 50 MHz external, 
47.6 MHz bus.  Cache lines are 16 bytes primary, 128 bytes secondary, and the 
bus (256 bits wide) can deliver data on 4 out of 5 cycles on one transaction,
for (256/8 bytes)*(47.6MHz)*(4/5) = 1.218 GB/s.  (The secondary cache is 1MB).
Next clock goes to 150 MHz, 75MHz, bus still at 47.6 MHz.  I have asked SGI
for an explanation of how the different parts of the system affect the latency,
and how it should change in the future.  If I hear from them, I'll send you 
the information.

   I believe the latency for the IBM 580 is about 15.5 cycles from other
tests I have run, and a TLB miss penalty of 38 cycles.

   I have run streams on several machines here (included below).  The 
Challenge runs better when I compile on the Crimson (? - I'll track this down).
If you would like me to run some small benchmarks for you on the 580 (or 
Challenge), let me know.

                                                Thanks again,
                                                      Mark


stream_d
SGI Crimson R4000 os 4.0.5  ftn 3.10
f77 -O2 -mips2
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    91.99999682605267     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   61.5385      0.2700      0.2600      0.2800
Scaling   :   59.2594      0.2750      0.2700      0.2800
Summing   :   58.5367      0.4140      0.4100      0.4200
SAXPYing  :   59.9999      0.4080      0.4000      0.4100
 
stream_d
SGI Challenge R4400 os 5.0    ftn 3.10
f77 -O2 -mips2
compiled on Challenge
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    122.9999981820583     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   48.4847      0.3410      0.3300      0.3500
Scaling   :   47.0589      0.3451      0.3400      0.3600
Summing   :   50.0000      0.4921      0.4800      0.5000
SAXPYing  :   54.5456      0.4451      0.4400      0.4600
 
stream_d
SGI Challenge R4400 os 5.0  ftn 3.10
f77 -O2 -mips2
compiled on Crimson
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    117.0000027865171     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   57.1429      0.2891      0.2800      0.3000
Scaling   :   55.1724      0.3001      0.2900      0.3100
Summing   :   53.3334      0.4641      0.4500      0.4800
SAXPYing  :   54.5456      0.4450      0.4400      0.4500
 
stream_s
SGI Crimson R4000 os 4.0.5  ftn 3.10
f77 -O2 -mips2
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =    47.00000     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   57.1429      0.1471      0.1400      0.1500
Scaling   :   53.3333      0.1571      0.1500      0.1700
Summing   :   54.5455      0.2270      0.2200      0.2300
SAXPYing  :   52.1739      0.2371      0.2300      0.2500
 
stream_s
SGI Challenge R4400 os 5.0  ftn 3.10
f77 -O2 -mips2
compiled on Challenge
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =    61.00000     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   40.0000      0.2061      0.2000      0.2100
Scaling   :   36.3636      0.2281      0.2200      0.2400
Summing   :   41.3793      0.2981      0.2900      0.3100
SAXPYing  :   48.0000      0.2550      0.2500      0.2600
 
stream_s
SGI Challenge R4400 os 5.0  ftn 3.10
f77 -O2 -mips2
compiled on Crimson
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =    62.00000     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   53.3333      0.1591      0.1500      0.1700
Scaling   :   47.0589      0.1741      0.1700      0.1900
Summing   :   50.0000      0.2440      0.2400      0.2500
SAXPYing  :   48.0000      0.2571      0.2500      0.2700
 
stream_d
SUN SS2 os 4.1.2 f77 SC1.0
f77 -O2 -cg89
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     382.99998268485 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   20.5129      0.8168      0.7800      1.0100
Scaling   :   18.6047      0.8640      0.8600      0.8900
Summing   :   21.8182      1.1482      1.1000      1.5000
SAXPYing  :   22.4299      1.1078      1.0700      1.3400
 
stream_d
SUN SS10/41 os 4.1.3 f77 SC2.0.1
f77 -O2 -cg92
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =     238.33334110677 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   34.2857      0.4868      0.4667      0.5000
Scaling   :   38.4001      0.4267      0.4167      0.4333
Summing   :   36.9231      0.6534      0.6500      0.6667
SAXPYing  :   37.8947      0.6467      0.6333      0.6500
 
stream_s
SUN SS2 os 4.1.2 f77 SC1.0
f77 -O2 -cg89
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     191.000 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   18.6046      0.4411      0.4300      0.4500
Scaling   :   17.7778      0.4640      0.4500      0.4700
Summing   :   19.3548      0.6270      0.6200      0.6300
SAXPYing  :   20.6897      0.6021      0.5800      0.6100
 
stream_s
SUN SS10/41 os 4.1.3 f77 SC2.0.1
f77 -O2 -cg92
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =     131.667 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   36.9232      0.2302      0.2167      0.2500
Scaling   :   34.2857      0.2418      0.2333      0.2500
Summing   :   37.8947      0.3301      0.3167      0.3333
SAXPYing  :   36.0000      0.3401      0.3333      0.3500
 
stream_s
RS/6000-580 xlf 2.03
f77 -O 
 Test #1 Failed = picalc=piexact
 Apparently Single=Double Precision
 Proceeding to Test #2
  
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =   67.00000000      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  177.7778       .1880       .1800       .1900
Scaling   :  118.5185       .2841       .2700       .2900
Summing   :  137.1429       .3551       .3500       .3800
SAXPYing  :  141.1765       .3551       .3400       .3700
--------------------------------------

stream_d
RS/6000-580 xlf 2.03
f77 -O 
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =   133.000000000000000      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  266.6667       .2512       .2400       .2700
Scaling   :  246.1538       .2661       .2600       .2800
Summing   :  234.1463       .4261       .4100       .4500
SAXPYing  :  228.5714       .4352       .4200       .4600

From alan@msc.edu  Tue Apr 20 18:32:54 1993
Received: from noc.msc.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA04401; Tue, 20 Apr 93 18:32:54 -0400
Received: from af.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA18958; Tue, 20 Apr 93 17:36:07 -0500
Received: by af.msc.edu (5.57/MSC/v3.0(901107))
	id AA01835; Tue, 20 Apr 93 17:36:06 -0500
Date: Tue, 20 Apr 93 17:36:06 -0500
From: alan@msc.edu
Message-Id: <9304202236.AA01835@af.msc.edu>
To: mccalpin
Subject: Re: New Chips & Memory Bandwidth
Newsgroups: comp.arch
In-Reply-To: <C5sx2o.Mow@news.udel.edu>
References: <1993Apr15.151349.9383@walter.cray.com>
Organization: Minnesota Supercomputer Center, Inc.
Cc: 
Status: RO

In article <C5sx2o.Mow@news.udel.edu> you write:
<
<Anybody have a CM-5 handy to test? (vector units preferred but not essential)

I tried the benchmark but was unable to get accurate results.  Even
with the largest possible memory size I was unable to get the time 
below the granularity of the system block (60 ticks/sec).  Also, note
that three other users were also running jobs at the same.

Note that the CM-5 has 64 bit wide memory, like a Cray.  32 bit memory
accesses require a read-modify-write sequence which takes three times longer.

Also note that all optimizations were turned off, because otherwise the
whole benchmark would have been deleted by the optimizer.

HW: CM-5, 256 SPARCs, 1024 Vector Units, 32 Mhz Clock
SW: CMOST 7.2 Beta 1 Patch 4, CMF 2.0 Beta 1, cmf -g [no optimization]

For stream_s, n=128 million.  For stream_d, n=256 million.  

Script started on Tue Apr 20 17:20:46 1993
e4 2% ./stream_s
-------------------------------------
Single precision appears to have  7 digits of accuracy
Assuming 4 bytes per default REAL word
-------------------------------------
Timing calibration ; time =   21.66667 hundredths of a second
Increase the size of the arrays if this is <30  and your clock precision is =<1
/100 second
---------------------------------------------------
unction     Rate (MB/s)  RMS time   Min time  Max time
ssignment:**********      0.1000      0.1000      0.1000
caling   :**********      0.0953      0.0833      0.1000
umming   :-9203.5098      0.1202      0.1167      0.1333
AXPYing  :**********      0.1138      0.0833      0.1167
e4 3% ./stream_d
-------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLEPRECISION word
-------------------------------------
Timing calibration ; time =   3.333333134651184 hundredths of a second
Increase the size of the arrays if this is <30  and your clock precision is =<1
/100 second
---------------------------------------------------
unction     Rate (MB/s)  RMS time   Min time  Max time
ssignment:**********      0.0264      0.0167      0.0333
caling   :**********      0.0279      0.0167      0.0333
umming   :**********      0.0441      0.0333      0.0500
AXPYing  :**********      0.0425      0.0333      0.0500
e4 4% exit
Script ended on Tue Apr 20 17:21:06 1993

Note that 0.0167s is the granularity of the system clock.

From alan@msc.edu  Tue Apr 20 18:36:09 1993
Received: from noc.msc.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA04407; Tue, 20 Apr 93 18:36:09 -0400
Received: from af.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA19197; Tue, 20 Apr 93 17:39:23 -0500
Received: by af.msc.edu (5.57/MSC/v3.0(901107))
	id AA01856; Tue, 20 Apr 93 17:39:22 -0500
Date: Tue, 20 Apr 93 17:39:22 -0500
From: alan@msc.edu
Message-Id: <9304202239.AA01856@af.msc.edu>
To: mccalpin
Subject: oops
Status: RO

The values I gave you for n were erroneous.

For stream_s, n=256 million (not 128 mil).

For stream_d, n=128 million (not 64 mil).

Sorry.

--
Alan E. Klietz
Minnesota Supercomputer Center, Inc.
1200 Washington Avenue South
Minneapolis, MN  55415
Tel: +1 612 337 3520	       Internet: alan@msc.edu
Fax: +1 612 337 3400	

From jgm@doug.Econ.QueensU.CA  Wed Apr 21 10:05:09 1993
Received: from doug.econ.QueensU.CA by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA05564; Wed, 21 Apr 93 10:05:09 -0400
Received: by doug.Econ.QueensU.CA (AIX 3.2/UCB 5.64/4.03)
          id AA03075; Wed, 21 Apr 1993 10:08:43 -0400
Date: Wed, 21 Apr 1993 10:03:06 +22300346 (EDT)
From: "James G. MacKinnon" <jgm@doug.Econ.QueensU.CA>
Subject: streams benchmark
To: John McCalpin <mccalpin>
Message-Id: <Pine.3.05.9304211006.A12031-c100000@doug>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO

I tried your streams benchmark on my new RS/6000 355, using xlf 2.3 with
-O3 (when I used the preprocessors, they apparently optimized everything
away). Here are the results:

out.IBM_355_d (n=1000000)

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =   46.0000008344650269      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  133.3335       .1210       .1200       .1300
Scaling   :  133.3335       .1281       .1200       .1300
Summing   :  114.2860       .2100       .2100       .2100
SAXPYing  :  120.0001       .2061       .2000       .2100


out.IBM_355_s (n=2000000)

--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =   56.00000000      hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  100.0000       .1641       .1600       .1700
Scaling   :   84.2105       .1951       .1900       .2000
Summing   :   85.7144       .2830       .2800       .2900
SAXPYing  :   85.7144       .2800       .2800       .2800

On this benchmark, my rather inexpensive 355 seems to compare very well
with much more expensive machines from Sun and SGI.

**********************************************************************
 
James G. MacKinnon                           Department of Economics
   phone: 613 545-2293                       Queen's University
     Fax: 613 545-6668                       Kingston, Ontario, Canada
Internet: jgm@doug.econ.queensu.ca           K7L 3N6
 

From jfc@Athena.MIT.EDU  Wed Apr 21 10:24:07 1993
Received: from ACHATES.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA05619; Wed, 21 Apr 93 10:24:07 -0400
Received: by Achates.MIT.EDU (5.61) id AA25108; Wed, 21 Apr 93 10:27:24 -0400
Message-Id: <9304211427.AA25108@Achates.MIT.EDU>
To: mccalpin
Subject: stream program result
Date: Wed, 21 Apr 1993 10:27:23 EDT
From: John Carr <jfc@Athena.MIT.EDU>
Status: RO


On a VAX 9000/420 (62.5 Mhz clock) running Ultrix 4.2, compiled with
fort -O -V vector, parameter N changed to 1200000 for more accurate
results.  The test only used 1 processor.


-------------------------------------
Double precision appears to have 17 digits of accuracy
Assuming 8 bytes per DOUBLEPRECISION word
-------------------------------------
Timing calibration ; time =    239.9999871850014     hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
unction     Rate (MB/s)  RMS time   Min time  Max time
ssignment:  144.0001      0.1486      0.1333      0.1667
caling   :  164.5719      0.1391      0.1167      0.1667
umming   :  172.7997      0.1972      0.1667      0.2167
AXPYing  :  157.0913      0.1918      0.1833      0.2000


-------------------------------------
Single precision appears to have  7 digits of accuracy
Assuming 4 bytes per default REAL word
-------------------------------------
Timing calibration ; time =    126.6667     hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
unction     Rate (MB/s)  RMS time   Min time  Max time
ssignment:   82.2860      0.1269      0.1167      0.1333
caling   :   82.2857      0.1335      0.1167      0.1500
umming   :   78.5455      0.2020      0.1833      0.2333
AXPYing  :   78.5455      0.2036      0.1833      0.2167

From jgm@doug.Econ.QueensU.CA  Wed Apr 21 11:14:13 1993
Received: from doug.econ.QueensU.CA by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA05742; Wed, 21 Apr 93 11:14:13 -0400
Received: by doug.Econ.QueensU.CA (AIX 3.2/UCB 5.64/4.03)
          id AA11865; Wed, 21 Apr 1993 11:18:12 -0400
Date: Wed, 21 Apr 1993 11:13:46 +22300346 (EDT)
From: "James G. MacKinnon" <jgm@doug.Econ.QueensU.CA>
Subject: Re: streams benchmark
To: "John D. McCalpin" <mccalpin>
In-Reply-To: <9304211412.AA05575@perelandra.cms.udel.edu>
Message-Id: <Pine.3.05.9304211146.A3157-a100000@doug>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO

I believe the 355 runs at the strange speed that was introduced with the
550 and also used by the 350. Some of my literature from IBM calls it 41 Mh
and some calls it 42. I think it is actually something like 41.7. 

The main differences between a 355 and a 350 are the larger instruction
cache for the former (32K vs 8K) and the greater memory and slots of the
latter (the 355 only has one memory slot, for 128MB maximum, and two Micro
Channel slots, one of which is filled by a GT3i graphics adapter). The 365
and 375 are like the 355, but run at 50 and 62.5 Mh, respectively.

**********************************************************************
 
James G. MacKinnon                           Department of Economics
   phone: 613 545-2293                       Queen's University
     Fax: 613 545-6668                       Kingston, Ontario, Canada
Internet: jgm@doug.econ.queensu.ca           K7L 3N6
 

From alan@msc.edu  Wed Apr 21 15:25:47 1993
Received: from noc.msc.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA07169; Wed, 21 Apr 93 15:25:47 -0400
Received: from af.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA29421; Wed, 21 Apr 93 13:58:29 -0500
Received: by af.msc.edu (5.57/MSC/v3.0(901107))
	id AA02238; Wed, 21 Apr 93 13:58:29 -0500
Date: Wed, 21 Apr 93 13:58:29 -0500
From: alan@msc.edu
Message-Id: <9304211858.AA02238@af.msc.edu>
To: mccalpin
Subject: Re: New Chips & Memory Bandwidth
Status: RO

>>Assignment:**********      0.0264      0.0167      0.0333
>>Scaling   :**********      0.0279      0.0167      0.0333
>>Summing   :**********      0.0441      0.0333      0.0500
>>SAXPYing  :**********      0.0425      0.0333      0.0500
>
>128 MB/(0.0264 s) = 4848 MB/s = 19 MB/s/cpu = 1/25 of peak
>Not so good for a first cut?

n=128 million (elements), but each element is 8 bytes wide, so 
it should be 19*8 = 152 MB/s/cpu.

>Of course, it is impossible to say anything very intelligent given
>these numbers.  Hopefully the TMC folks can manage something a bit
>more concrete.

Agreed.

From keith@earth.ox.ac.uk  Thu Apr 22 10:57:05 1993
Received: from eeyore.earth.ox.ac.uk by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA09931; Thu, 22 Apr 93 10:57:05 -0400
From: Keith Refson <keith@earth.ox.ac.uk>
Received: from rahman.earth.earth.ox.ac.uk (rahman.earth.ox.ac.uk) by earth.ox.ac.uk; Thu, 22 Apr 93 16:00:23 BST
Date: Thu, 22 Apr 93 16:00:23 BST
Message-Id: <18135.9304221500@rahman.earth.earth.ox.ac.uk>
To: mccalpin
Subject: Memory bandwidth tests
Status: RO

Dear Prof. McCalpin,

I am a little puzzled by the results of your "bandwidth" benchmark, and I
wonder if you could explain them to me.  In a nutshell, I don't see how
you get the memory bandwidth figures from the results of running your 
program.  

I have some money to invest in a machine to do serious numerical
calculations which are pretty memory -intensive and so I am keenly
interested in the results.  I would also be very interested in your 
opinions of the various competitors.  I am looking at IBM/HP/DECAlpha
/SGI Challenge machines.  

You may be interested in my benchmarking results for these and other
machines.  The program is a MD simulation code written by me in C.
The comparison of the "small", ie in-cache runs with the "big" 30MB
ones is *very* revealing and does sort out the sheep from the goats.
If so, just get the file "/pub/benchmark.tex" by anonymous ftp from
earth.ox.ac.uk (or eeyore.earth.ox.ac.uk or 163.1.22.1 -- we changed
our DNS records recently and it may not have propagated yet).

In any case I have some results from running your benchmark for
DEX Alpha/HP 755/735 and STardent Titan P3. My titan results differ
substantially from those you have reported -- I don't know why.
But my results agree well with the theoretical bus bandwidth
of 256 MB/sec.

sincerely

Keith Refson

--------------------------------------------------
Here are the results.


HP 755/735 

feynman 22: ./a.out
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =  21.00000046193599 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   68.5715       .0700       .0700       .0700  
Scaling   :   68.5715       .0700       .0700       .0700  
Summing   :   72.0001       .1021       .1000       .1100  
SAXPYing  :   80.0001       .0971       .0900       .1000  


DEC 3000/500 (150MHz AXP Alpha)

axpbb% ./a.out
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    16.49440079927444     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  100.3680      0.0485      0.0478      0.0488
Scaling   :   96.4323      0.0499      0.0498      0.0508
Summing   :   98.3607      0.0736      0.0732      0.0742
SAXPYing  :   99.6900      0.0730      0.0722      0.0732

Kubota (ex Stardent Titan P3 ) 1processor

zebedee% a.out
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    37.00000196695328     hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  240.0002      0.0265      0.0200      0.0300
Scaling   :  240.0002      0.0283      0.0200      0.0300
Summing   :  239.9993      0.0391      0.0300      0.0400
SAXPYing  :  144.0001      0.0522      0.0500      0.0600

Kubota (ex Stardent Titan P3 ) 4 processors

zebedee% a.out
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    79.99998927116394     hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  480.0005      0.0228      0.0100      0.0300
Scaling   :  240.0002      0.0349      0.0200      0.0900
Summing   :  240.0002      0.0363      0.0300      0.0400
SAXPYing  :  240.0002      0.0449      0.0300      0.0600


From mash@mash.wpd.sgi.com  Fri Apr 23 11:45:34 1993
Received: from SGI.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA14952; Fri, 23 Apr 93 11:45:34 -0400
Received: from [192.26.61.16] by sgi.sgi.com via SMTP (920330.SGI/910110.SGI)
	for mccalpin@perelandra.cms.udel.edu id AA28899; Fri, 23 Apr 93 08:48:56 -0700
Received: by mash.wpd.sgi.com (920330.SGI/911001.SGI)
	for @sgi.sgi.com:mccalpin@perelandra.cms.udel.edu id AA16415; Fri, 23 Apr 93 08:48:55 -0700
From: mash@mash.wpd.sgi.com (John R. Mashey)
Message-Id: <9304230848.ZM16413@mash.wpd.sgi.com>
Date: Fri, 23 Apr 1993 08:48:55 -0700
In-Reply-To: "John D. McCalpin" <mccalpin@perelandra.cms.udel.edu>
        "Re: CPU Speed vs. Memory Bandwidth" (Apr 22, 10:39pm)
References: <8735@fury.BOEING.COM> 
	<9304230239.AA12899@perelandra.cms.udel.edu>
X-Mailer: Z-Mail (2.1.0 10/1/92)
To: "John D. McCalpin" <mccalpin>
Subject: Re: CPU Speed vs. Memory Bandwidth
Status: RO

On Apr 22, 10:39pm, "John D. McCalpin" wrote:
> Subject: Re: CPU Speed vs. Memory Bandwidth
> In article <C5wvxM.M24@odin.corp.sgi.com> you write:
> >For example, each memory board in a Challenge supports two-eay interleaving:
> >each leaf can start a new read request every 200ns, but the system bus can
deliver request 2X faster.  For example, if you had an infinitely fast CPU
Key phrase: -------------------------------------------^^^^^^^^^^^^^^^^^^^
(Current R4400 CPUs are not infinitely fast :-) Thw wording was carefully
chosen!

> latencies) gives the observed 60 MB/s limit.
>
> Is it supposed to be different?
Sounds about right.
	As you know:
	1) current R4400s have 2-level caches, and relatively little overlap.
	2) 150Mhz ones are faster, but not different.
	3) TFPs have 1-level caches (as far as FP goes) and more overlap.
	4) and T5s  will have much more overlap, more prefetching, etc.


From lamaster@george.arc.nasa.gov  Fri Apr 23 11:47:40 1993
Received: from george.arc.nasa.gov by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA14986; Fri, 23 Apr 93 11:47:40 -0400
Received: by george.arc.nasa.gov (4.1/1.35)
	id AA15325; Fri, 23 Apr 93 08:51:04 PDT
Date: Fri, 23 Apr 93 08:51:04 PDT
From: lamaster@george.arc.nasa.gov (Hugh LaMaster -- RCS)
Message-Id: <9304231551.AA15325@george.arc.nasa.gov>
To: mccalpin
Subject: Re: CPU Speed vs. Memory Bandwidth
Status: RO

>Except that one of the tests is simply a copy, which should not
>require any FPU intervention.

I guess I bungled my attempt to express what you just wrote.
Sorry about that.  "Compute" was a poor choice of words intended
to express that we are looking at CPU load/store here, plus maybe
FP operations, but not the raw speed of the bus.  John Mashey's
post is typical of what you get from a vendor -- the guaranteed
not to exceed speed of the bus.  However, in the case of the SGI, 
only a small fraction is available to an application in any single CPU.
I admit that I am disappointed by the SGI numbers.  There seems to
be a bit of a cache bottleneck there.  The DEC 3000/500 looks better:
and, I am told, has better random access speed than IBM.

I hope you don't mind my posting your results.  I couldn't resist
the opportunity to bring up my favorite subject while responding
to someone else.

>
>>Does anyone have numbers for the new DEC AXP systems, and new HP 
>>7100-based systems?  Can someone from DEC or HP post results?
>
>I have received a number of new results lately -- check the archive.
>These include SGI Challenge, SGI Crimson, DEC 3000/500, HP 9000/755,
>and IBM RS/6000-580.  The results are very interesting....


I had fetched the previous week's table, which had only the SGI and IBM.  
I just fetched the latest and posted the HP and DEC numbers as a 
supplement (I hope before 20 other people do as well.)  Thanks!


Regards,
Hugh LaMaster


From mash@mash.wpd.sgi.com  Fri Apr 23 18:06:14 1993
Received: from SGI.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA17432; Fri, 23 Apr 93 18:06:14 -0400
Received: from [192.26.61.16] by sgi.sgi.com via SMTP (920330.SGI/910110.SGI)
	for mccalpin@perelandra.cms.udel.edu id AA28729; Fri, 23 Apr 93 15:09:39 -0700
Received: by mash.wpd.sgi.com (920330.SGI/911001.SGI)
	for @sgi.sgi.com:mccalpin@perelandra.cms.udel.edu id AA16850; Fri, 23 Apr 93 15:09:37 -0700
From: mash@mash.wpd.sgi.com (John R. Mashey)
Message-Id: <9304231509.ZM16848@mash.wpd.sgi.com>
Date: Fri, 23 Apr 1993 15:09:36 -0700
In-Reply-To: "John D. McCalpin" <mccalpin@perelandra.cms.udel.edu>
        "Re: CPU Speed vs. Memory Bandwidth" (Apr 23, 11:53am)
References: <8735@fury.BOEING.COM> 
	<9304230239.AA12899@perelandra.cms.udel.edu> 
	<9304231553.AA14994@perelandra.cms.udel.edu>
X-Mailer: Z-Mail (2.1.0 10/1/92)
To: "John D. McCalpin" <mccalpin>
Subject: Re: CPU Speed vs. Memory Bandwidth
Status: RO

On Apr 23, 11:53am, "John D. McCalpin" wrote:
> Subject: Re: CPU Speed vs. Memory Bandwidth
> >                            For example, if you had an infinitely fast CPU
> >Key phrase: -------------------------------------------^^^^^^^^^^^^^^^^^^^
> >(Current R4400 CPUs are not infinitely fast :-) Thw wording was carefully
> >chosen!
>
> The bottleneck here is definitely not in the CPU, but in the cache
Sorry: when I said CPU in this context, I meant the CPU subsystem up to
the memory interface,(as opposed to the memory side of the interface).
This may be wrong thinking on my part, but I do it because there are
so many different partitionings of the CPU subsystem.  After all,
for an R4400, all of the cache control is part of the CPU chip itself...


> It is interesting to hear that the TFP machines will not have level 2
> cache for the FP operands.  The experience of Sun and SGI seems to be
> that 2-level caches are a risky proposition, performance-wise.

Well, yes, and no, depending on what people do.
For what *you* do, TFPs will be a much better match.
For many things, I'm quite happy with 2-level caches.
(In fact, TFP has 2-level caches for the code and integer data, 1 level for FP
data, i.e., it goes like this:

CPU + I-cache + D-cache (1 chip)   -------  |
					    scache
FPU				   -------  |

that is, the FPU has direct access to multi-MB 4-set-associative cache,
and the whole CPU+FP complex can do 4 of:
	2 64-bit loads/stores, including index FP load/stores
	2 FP Mult-adds (or other FP ops)
	2 integer operations
per cycle

and quite obviously, this won't help integer performance much directly,
i.e., not a lot faster than 150Mhz R4400 ... but it will certainly
help your apps.  Also, the interface from system bus <-> cache is 128
bits rather than the 64 of the R4400s.


From gklaass@nexus.yorku.ca  Wed Apr 28 12:21:52 1993
Received: from nexus.yorku.ca by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA01745; Wed, 28 Apr 93 12:21:52 -0400
Received: by nexus.yorku.ca id <9223>; Wed, 28 Apr 1993 12:25:29 -0400
From: Gary Klaassen <gklaass@nexus.yorku.ca>
To: mccalpin
Subject: Re: SPECint92, and other benchmarks (really LINPACK)
Newsgroups: comp.sys.sgi.hardware,comp.benchmarks
References: <1993Apr13.225551.26609@texhrc.uucp> <gklaass.734819791@yorku.ca> <1rlmriINNq9q@usenet.pa.dec.com> <C6748w.H06@news.udel.edu>
Message-Id: <93Apr28.122529edt.9223@nexus.yorku.ca>
Date: 	Wed, 28 Apr 1993 12:25:16 -0400
Status: RO

John: 

>>In article <gklaass.734819791@yorku.ca> gklaass@nexus.yorku.ca (Gary Klaassen) writes:
>>>Several others have pointed out that TFP Power Challenge will
>>>incorporate a streaming cache capable of supplying data to the
>>>floating point units at a rate of 21.6 GBytes per second. This is
>>>presumably the rate between the cache and FPU, but what about the rate
>>>between memory and cache? 

You replied:
>There are two levels of cache in the current machine (the "Challenge"),
>but I am told that the "Power Challenge" will bypass the first-level 
>cache for FP operands.

This jives with what John Mashey told me. The Power Challenge Icache will
be 32KB, the Dcache is 16KB, and the secondary external cache will
have 2-16MB and a direct line to the FPU.

>I do not know the exact specs for the primary cache interface on the
>Challenge machines, except that the cache line is 16 bytes.   Assuming

There are 3 caches, 16KB Instruction, 16KB data and 1MB external.
The filed engineer told me they have 10ns RAM.

>a second-level cache hit has a latency of 3 cycles (chosen out of a 
>nearby hat) and a 128-bit wide interface, then the transfer time is 
>4 cycles, and the bandwidth is 300 MB/s.  Since the transfer time 
>is much smaller than the latency, this throughput estimate is roughly
>a linear function of the latency.   Any better numbers from SGI?


>The traffic between secondary cache and main memory goes over the main
>system bus with a peak throughput of 1.2 GB/s.  Unfortunately, a single cpu
>can only get about 1/20 of this performance via the cache interface.  The
>problem is that the latency is about 55-60 cycles (at 75 MHz).  The
>secondary cache lines are 128 bytes, and the bus width is 256 bits.  So the
>cache miss time is 60 cycles latency plus 4 cycles for the actual data
>transfer.   This gives a peak throughput of about 90 MB/s  (note that the
>bus clock is about 48 MHz, while the cpu external clock is 75 MHz).

Yeah, tell me about it. We bought one of these Challenge L's only
to discover it's memory throughput is more or less the same as an Indigo.
This is of course because the memory subsystem is basically the same
as an Indigo (some differences for SMP). Foir some reason they negelcted
to mention this. Moral: don't trust marketing numbers.

>Observed throughput from my "STREAM" benchmark shows about 60 MB/s on
>a single-cpu system.   Results are available by anonymous ftp to 
>perelandra.cms.udel.edu in bench/stream/.

I am curious as to why the IBM 580 numbers are so much higher? How much
of this is due to the memory bandwidth and how much is due to the 
superscalar nature of the RS6000 cpu? The benchmarks you chose are well
suited to RS6000 superscalar and would disadvantage some of the other
chips in the table.

>Since this throughput is dominated by latency, presumably faster cache
>controller hardware in future revisions will run faster.  The TFP chip
>will be able to cut the latency immediately by bypassing the first-
>level cache for FP operands.

I certainly hope so.

Regards, Gary

Prof. G. P. Klaassen
Dept. of Earth and Atmospheric Science
York University, North York, Ontario, Canada  M3J 1P3
Email: gklaass@nexus.yorku.ca

From rsw@decade.maths.unsw.EDU.AU  Thu Apr 29 19:35:36 1993
Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA07375; Thu, 29 Apr 93 19:35:36 -0400
Received: from decade.maths.unsw.EDU.AU ([149.171.180.5]) by bach.udel.edu with SMTP
	(5.65c/IDA-1.2.8) id AA24874; Thu, 29 Apr 1993 19:39:16 -0400
Received: by decade.maths.unsw.EDU.AU (5.65/1.35)
	id AA05823; Fri, 30 Apr 1993 09:40:33 +1000
Date: Fri, 30 Apr 1993 09:40:33 +1000
From: rsw@decade.maths.unsw.EDU.AU
Message-Id: <9304292340.AA05823@decade.maths.unsw.EDU.AU>
To: mccalpin@bach.udel.edu
Subject: Stream on CM5
Status: RO


I have a modified version of your stream benchmark running on a CM5
and results for a 16 PN partition.

Happy to let you have program and results if someone else has not
already provided.

In double precision the SAXPY results are about 7GB/s (theoretical
paek of 8 GB/s with 16 PNs). Single precision much worse.

Dr. Rob Womersley                       E-mail: R.Womersley@unsw.edu.au
School of Mathematics                           rsw@hydra.maths.unsw.edu.au
University of New South Wales           Phone:  61 - 2 - 697-2998
P.O. Box 1, Kensington NSW 2033         Fax:    61 - 2 - 662-6445
AUSTRALIA

From rsw@hydra.maths.unsw.EDU.AU  Thu Apr 29 21:25:22 1993
Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA07503; Thu, 29 Apr 93 21:25:22 -0400
Received: by hydra.maths.unsw.EDU.AU (5.61/1.35)
	id AA00588; Fri, 30 Apr 93 11:30:31 +1000
Date: Fri, 30 Apr 93 11:30:31 +1000
From: rsw@hydra.maths.unsw.EDU.AU
Message-Id: <9304300130.AA00588@hydra.maths.unsw.EDU.AU>
To: mccalpin
Subject: Re:  Stream on CM5
Status: RO

* Program: Stream
* Programmer: John D. McCalpin
* Revision: 2.0, September 30,1991
*
* CM5 Data Parallel version by Rob Womersely  19 April, 1993
*
* This program measures memory transfer rates in MB/s for simple
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
* INSTRUCTIONS:
*       1) Stream requires a cpu timing function called second().
*          A sample is shown below.  This is unfortunately rather
*          system dependent.  It helps to know the granularity of the
*          timing.  The code below assumes that the granularity is
*          1/100 seconds.
*       2) Stream requires a good bit of memory to run.
*          Adjust the Parameter 'N' in the second line of the main
*          program to give a 'timing calibration' of at least 20 clicks.
*          This will provide rate estimates that should be good to
*          about 5% precision.
*       3) Compile the code with full optimization.  Many compilers
*          generate unreasonably bad code before the optimizer tightens
*          things up.  If the results are unreasonable good, on the
*          other hand, the optimizer might be too smart for me!
*       4) Mail the results to mccalpin@perelandra.cms.udel.edu
*          Be sure to include:
*               a) computer hardware model number and software revision
*               b) the compiler flags
*               c) all of the output from the test case.
*
* Thanks!
*
      PROGRAM stream
C     .. Parameters ..
      INTEGER, PARAMETER ::  n = 5000000, ntimes = 20
C     ..
C     .. Local Scalars ..
      DOUBLE PRECISION t, t0
      INTEGER j, k, nbpw, nvu
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION, ARRAY(n) ::  a, b, c
CMF$  LAYOUT a(:news), b(:news), c(:news)

      DOUBLE PRECISION, ARRAY(4) :: maxtime,mintime,rmstime
      DOUBLE PRECISION, ARRAY(4, ntimes) :: times

      INTEGER bytes(4)
      CHARACTER label(4)*12
C     ..
C     .. External Functions ..
      INTEGER CMF_number_of_processors
      DOUBLE PRECISION CM_timer_read_cm_busy, CM_timer_read_cm_idle
      DOUBLE PRECISION CM_timer_read_elapsed
      EXTERNAL CMF_number_of_processors
      EXTERNAL CM_timer_read_cm_busy, CM_timer_read_cm_idle
      EXTERNAL CM_timer_read_elapsed

      INTEGER realsize
      EXTERNAL realsize

C     ..
C     .. Intrinsic Functions ..
      INTRINSIC dble,max,min,sqrt
C     ..
C     .. Data statements ..
      DATA label/' Assignment:',' Scaling   :',' Summing   :',
     $     ' SAXPYing  :'/
      DATA bytes/2,2,3,3/
C     ..

*       --- SETUP --- determine precision and check timing ---

      PRINT *,'STREAM: Measure memory transfer rates in MB/s'
      PRINT *,'for simple computational kernels in Fortran'
      PRINT *
      PRINT *,'CALL CMF_describe_array(a)'
      CALL CMF_describe_array(a)

      PRINT *
      nvu = CMF_number_of_processors()
      WRITE(*,'(/1x,A,I2,A,I2,A/)') 'CM5 with partition of ',nvu/4,
     $     ' processors ( ',nvu,' vector units )'

      nbpw = realsize()

      CALL CM_timer_clear(0)
      CALL CM_timer_start(0)
      a = 1.0D0
      b = 2.0D0
      c = 0.0D0
      CALL CM_timer_stop(0)
      t = CM_timer_read_elapsed(0)
      PRINT *
      PRINT *,'Vector length = ', n
      PRINT *,'Timing calibration: Time = ',t*100,' hundredths',
     $  ' of a second'
      PRINT *,'Increase the size of the arrays if this is < 30'
      PRINT *,'and your clock precision is =< 1/100 second'

*       --- MAIN LOOP --- repeat test cases NTIMES times ---
      DO 60 k = 1, ntimes

          CALL CM_timer_clear(1)
          CALL CM_timer_start(1)
          c = a
          CALL CM_timer_stop(1)
          t = CM_timer_read_elapsed(1)
          times(1,k) = t

          CALL CM_timer_clear(2)
          CALL CM_timer_start(2)
          c = 3.0D0 * a
          CALL CM_timer_stop(2)
          t = CM_timer_read_elapsed(2)
          times(2,k) = t

          CALL CM_timer_clear(3)
          CALL CM_timer_start(3)
          c = a + b
          CALL CM_timer_stop(3)
          t = CM_timer_read_elapsed(3)
          times(3,k) = t

          CALL CM_timer_clear(4)
          CALL CM_timer_start(4)
          c = a + 3.0D0 * b
          CALL CM_timer_stop(4)
          t = CM_timer_read_elapsed(4)
          times(4,k) = t

   60 CONTINUE

*       --- SUMMARY ---
      rmstime = SUM(times**2, DIM=2)
      rmstime = SQRT( rmstime/dble(ntimes) )
      mintime = MINVAL(times, DIM=2)
      maxtime = MAXVAL(times, DIM=2)
      WRITE (*,FMT=9000)
      DO 90 j = 1,4
          WRITE (*,FMT=9010) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6,
     $      rmstime(j),mintime(j),maxtime(j)
   90 CONTINUE

 9000 FORMAT (/1x, 57('-'),/,' Function  :',1x,
     $        'Rate (MB/s)  RMS time    Min time    Max time')
 9010 FORMAT (a,4(f10.4,2x))
      END


*-------------------------------------
* INTEGER FUNCTION dblesize()
*
* A semi-portable way to determine the precision of DOUBLEPRECISION
* in Fortran.
* Here used to guess how many bytes of storage a DOUBLEPRECISION 
* number occupies.
*
      INTEGER FUNCTION realsize()

C     .. Local Scalars ..
      DOUBLE PRECISION result,test
      INTEGER j,ndigits
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION ref(30)
C     ..
C     .. External Subroutines ..
      EXTERNAL dummy
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC abs,acos,log10,sqrt
C     ..

C       Test #1 - compare single(1.0d0+delta) to 1.0d0

   10 DO 20 j = 1,30
          ref(j) = 1.0d0 + 10.0d0**(-j)
   20 CONTINUE

      DO 30 j = 1,30
          test = ref(j)
          ndigits = j
          CALL dummy(test,result)
          IF (test.EQ.1.0D0) THEN
              GO TO 50
          END IF
   30 CONTINUE
      GOTO 60

   50 WRITE (*,FMT='(a)') ' --------------------------------------'
      WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ',
     $  ndigits,' digits of accuracy'
      IF (ndigits.LE.8) THEN
          realsize = 4
      ELSE
          realsize = 8
      END IF
      WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize,
     $  ' bytes per DOUBLEPRECISION word'
      WRITE (*,FMT='(a)') ' --------------------------------------'
      RETURN

   60 PRINT *,' Hmmmm.  I am unable to determine the size of a REAL'
      PRINT *,' Please enter the number of Bytes per DOUBLEPRECISION',
     $  ' number : '
      READ (*,FMT=*) realsize
      IF (realsize.NE.4 .AND. realsize.NE.8) THEN
          PRINT *,' Your answer ',realsize,' does not make sense!'
          PRINT *,' Try again!'
          PRINT *,' Please enter the number of Bytes per ',
     $      'REAL number : '
          READ (*,FMT=*) realsize
      END IF
      PRINT *,'You have manually entered a size of ',realsize,
     $  ' bytes per REAL number'
      WRITE (*,FMT='(a)') '--------------------------------------'
      END

      SUBROUTINE dummy(q,r)
C     .. Scalar Arguments ..
      DOUBLE PRECISION q,r
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC cos
C     ..
      r = cos(q)
      RETURN
      END

From rsw@hydra.maths.unsw.EDU.AU  Thu Apr 29 21:29:24 1993
Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA07508; Thu, 29 Apr 93 21:29:24 -0400
Received: by hydra.maths.unsw.EDU.AU (5.61/1.35)
	id AA01613; Fri, 30 Apr 93 11:34:33 +1000
Date: Fri, 30 Apr 93 11:34:33 +1000
From: rsw@hydra.maths.unsw.EDU.AU
Message-Id: <9304300134.AA01613@hydra.maths.unsw.EDU.AU>
To: mccalpin
Subject: CM5 stream_d.fcm
Status: RO

* Program: Stream
* Programmer: John D. McCalpin
* Revision: 2.0, September 30,1991
*
* CM5 Data Parallel version by Rob Womersely  19 April, 1993
*
* This program measures memory transfer rates in MB/s for simple
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
* INSTRUCTIONS:
*       1) Stream requires a cpu timing function called second().
*          A sample is shown below.  This is unfortunately rather
*          system dependent.  It helps to know the granularity of the
*          timing.  The code below assumes that the granularity is
*          1/100 seconds.
*       2) Stream requires a good bit of memory to run.
*          Adjust the Parameter 'N' in the second line of the main
*          program to give a 'timing calibration' of at least 20 clicks.
*          This will provide rate estimates that should be good to
*          about 5% precision.
*       3) Compile the code with full optimization.  Many compilers
*          generate unreasonably bad code before the optimizer tightens
*          things up.  If the results are unreasonable good, on the
*          other hand, the optimizer might be too smart for me!
*       4) Mail the results to mccalpin@perelandra.cms.udel.edu
*          Be sure to include:
*               a) computer hardware model number and software revision
*               b) the compiler flags
*               c) all of the output from the test case.
*
* Thanks!
*
      PROGRAM stream
C     .. Parameters ..
      INTEGER, PARAMETER ::  n = 5000000, ntimes = 20
C     ..
C     .. Local Scalars ..
      DOUBLE PRECISION t, t0
      INTEGER j, k, nbpw, nvu
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION, ARRAY(n) ::  a, b, c
CMF$  LAYOUT a(:news), b(:news), c(:news)

      DOUBLE PRECISION, ARRAY(4) :: maxtime,mintime,rmstime
      DOUBLE PRECISION, ARRAY(4, ntimes) :: times

      INTEGER bytes(4)
      CHARACTER label(4)*12
C     ..
C     .. External Functions ..
      INTEGER CMF_number_of_processors
      DOUBLE PRECISION CM_timer_read_cm_busy, CM_timer_read_cm_idle
      DOUBLE PRECISION CM_timer_read_elapsed
      EXTERNAL CMF_number_of_processors
      EXTERNAL CM_timer_read_cm_busy, CM_timer_read_cm_idle
      EXTERNAL CM_timer_read_elapsed

      INTEGER realsize
      EXTERNAL realsize

C     ..
C     .. Intrinsic Functions ..
      INTRINSIC dble,max,min,sqrt
C     ..
C     .. Data statements ..
      DATA label/' Assignment:',' Scaling   :',' Summing   :',
     $     ' SAXPYing  :'/
      DATA bytes/2,2,3,3/
C     ..

*       --- SETUP --- determine precision and check timing ---

      PRINT *,'STREAM: Measure memory transfer rates in MB/s'
      PRINT *,'for simple computational kernels in Fortran'
      PRINT *
      PRINT *,'CALL CMF_describe_array(a)'
      CALL CMF_describe_array(a)

      PRINT *
      nvu = CMF_number_of_processors()
      WRITE(*,'(/1x,A,I2,A,I2,A/)') 'CM5 with partition of ',nvu/4,
     $     ' processors ( ',nvu,' vector units )'

      nbpw = realsize()

      CALL CM_timer_clear(0)
      CALL CM_timer_start(0)
      a = 1.0D0
      b = 2.0D0
      c = 0.0D0
      CALL CM_timer_stop(0)
      t = CM_timer_read_elapsed(0)
      PRINT *
      PRINT *,'Vector length = ', n
      PRINT *,'Timing calibration: Time = ',t*100,' hundredths',
     $  ' of a second'
      PRINT *,'Increase the size of the arrays if this is < 30'
      PRINT *,'and your clock precision is =< 1/100 second'

*       --- MAIN LOOP --- repeat test cases NTIMES times ---
      DO 60 k = 1, ntimes

          CALL CM_timer_clear(1)
          CALL CM_timer_start(1)
          c = a
          CALL CM_timer_stop(1)
          t = CM_timer_read_elapsed(1)
          times(1,k) = t

          CALL CM_timer_clear(2)
          CALL CM_timer_start(2)
          c = 3.0D0 * a
          CALL CM_timer_stop(2)
          t = CM_timer_read_elapsed(2)
          times(2,k) = t

          CALL CM_timer_clear(3)
          CALL CM_timer_start(3)
          c = a + b
          CALL CM_timer_stop(3)
          t = CM_timer_read_elapsed(3)
          times(3,k) = t

          CALL CM_timer_clear(4)
          CALL CM_timer_start(4)
          c = a + 3.0D0 * b
          CALL CM_timer_stop(4)
          t = CM_timer_read_elapsed(4)
          times(4,k) = t

   60 CONTINUE

*       --- SUMMARY ---
      rmstime = SUM(times**2, DIM=2)
      rmstime = SQRT( rmstime/dble(ntimes) )
      mintime = MINVAL(times, DIM=2)
      maxtime = MAXVAL(times, DIM=2)
      WRITE (*,FMT=9000)
      DO 90 j = 1,4
          WRITE (*,FMT=9010) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6,
     $      rmstime(j),mintime(j),maxtime(j)
   90 CONTINUE

 9000 FORMAT (/1x, 57('-'),/,' Function  :',1x,
     $        'Rate (MB/s)  RMS time    Min time    Max time')
 9010 FORMAT (a,4(f10.4,2x))
      END


*-------------------------------------
* INTEGER FUNCTION dblesize()
*
* A semi-portable way to determine the precision of DOUBLEPRECISION
* in Fortran.
* Here used to guess how many bytes of storage a DOUBLEPRECISION 
* number occupies.
*
      INTEGER FUNCTION realsize()

C     .. Local Scalars ..
      DOUBLE PRECISION result,test
      INTEGER j,ndigits
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION ref(30)
C     ..
C     .. External Subroutines ..
      EXTERNAL dummy
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC abs,acos,log10,sqrt
C     ..

C       Test #1 - compare single(1.0d0+delta) to 1.0d0

   10 DO 20 j = 1,30
          ref(j) = 1.0d0 + 10.0d0**(-j)
   20 CONTINUE

      DO 30 j = 1,30
          test = ref(j)
          ndigits = j
          CALL dummy(test,result)
          IF (test.EQ.1.0D0) THEN
              GO TO 50
          END IF
   30 CONTINUE
      GOTO 60

   50 WRITE (*,FMT='(a)') ' --------------------------------------'
      WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ',
     $  ndigits,' digits of accuracy'
      IF (ndigits.LE.8) THEN
          realsize = 4
      ELSE
          realsize = 8
      END IF
      WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize,
     $  ' bytes per DOUBLEPRECISION word'
      WRITE (*,FMT='(a)') ' --------------------------------------'
      RETURN

   60 PRINT *,' Hmmmm.  I am unable to determine the size of a REAL'
      PRINT *,' Please enter the number of Bytes per DOUBLEPRECISION',
     $  ' number : '
      READ (*,FMT=*) realsize
      IF (realsize.NE.4 .AND. realsize.NE.8) THEN
          PRINT *,' Your answer ',realsize,' does not make sense!'
          PRINT *,' Try again!'
          PRINT *,' Please enter the number of Bytes per ',
     $      'REAL number : '
          READ (*,FMT=*) realsize
      END IF
      PRINT *,'You have manually entered a size of ',realsize,
     $  ' bytes per REAL number'
      WRITE (*,FMT='(a)') '--------------------------------------'
      END

      SUBROUTINE dummy(q,r)
C     .. Scalar Arguments ..
      DOUBLE PRECISION q,r
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC cos
C     ..
      r = cos(q)
      RETURN
      END

From rsw@hydra.maths.unsw.EDU.AU  Thu Apr 29 21:29:45 1993
Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA07512; Thu, 29 Apr 93 21:29:45 -0400
Received: by hydra.maths.unsw.EDU.AU (5.61/1.35)
	id AA01728; Fri, 30 Apr 93 11:34:57 +1000
Date: Fri, 30 Apr 93 11:34:57 +1000
From: rsw@hydra.maths.unsw.EDU.AU
Message-Id: <9304300134.AA01728@hydra.maths.unsw.EDU.AU>
To: mccalpin
Subject: CM5 stream_s.fcm
Status: RO

* Program: Stream
* Programmer: John D. McCalpin
* Revision: 2.0, September 30,1991
*
* CM5 Data Parallel version by Rob Womersely  19 April, 1993
*
* This program measures memory transfer rates in MB/s for simple
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
* INSTRUCTIONS:
*       1) Stream requires a cpu timing function called second().
*          A sample is shown below.  This is unfortunately rather
*          system dependent.  It helps to know the granularity of the
*          timing.  The code below assumes that the granularity is
*          1/100 seconds.
*       2) Stream requires a good bit of memory to run.
*          Adjust the Parameter 'N' in the second line of the main
*          program to give a 'timing calibration' of at least 20 clicks.
*          This will provide rate estimates that should be good to
*          about 5% precision.
*       3) Compile the code with full optimization.  Many compilers
*          generate unreasonably bad code before the optimizer tightens
*          things up.  If the results are unreasonable good, on the
*          other hand, the optimizer might be too smart for me!
*       4) Mail the results to mccalpin@perelandra.cms.udel.edu
*          Be sure to include:
*               a) computer hardware model number and software revision
*               b) the compiler flags
*               c) all of the output from the test case.
*
* Thanks!
*
      PROGRAM stream
C     .. Parameters ..
      INTEGER, PARAMETER ::  n = 5000000, ntimes = 20
C     ..
C     .. Local Scalars ..
      REAL t, t0
      INTEGER j, k, nbpw, nvu
C     ..
C     .. Local Arrays ..
      REAL, ARRAY(n) ::  a, b, c
CMF$  LAYOUT a(:news), b(:news), c(:news)

      REAL, ARRAY(4) :: maxtime,mintime,rmstime
      REAL, ARRAY(4, ntimes) :: times

      INTEGER bytes(4)
      CHARACTER label(4)*12
C     ..
C     .. External Functions ..
      INTEGER CMF_number_of_processors
      DOUBLE PRECISION CM_timer_read_cm_busy, CM_timer_read_cm_idle
      DOUBLE PRECISION CM_timer_read_elapsed
      EXTERNAL CMF_number_of_processors
      EXTERNAL CM_timer_read_cm_busy, CM_timer_read_cm_idle
      EXTERNAL CM_timer_read_elapsed

      INTEGER realsize
      EXTERNAL realsize
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC float,max,min,sqrt
C     ..
C     .. Data statements ..
      DATA label/' Assignment:',' Scaling   :',' Summing   :',
     $     ' SAXPYing  :'/
      DATA bytes/2,2,3,3/
C     ..

*       --- SETUP --- determine precision and check timing ---

      PRINT *,'STREAM: Measure memory transfer rates in MB/s'
      PRINT *,'for simple computational kernels in Fortran'
      PRINT *
      PRINT *,'CALL CMF_describe_array(a)'
      CALL CMF_describe_array(a)

      nvu = CMF_number_of_processors()
      WRITE(*,'(/1x,A,I2,A,I2,A/)') 'CM5 with partition of ',nvu/4,
     $     ' processors ( ',nvu,' vector units )' 

      nbpw = realsize()

      CALL CM_timer_clear(0)
      CALL CM_timer_start(0)
      a = 1.0
      b = 2.0
      c = 0.0
      CALL CM_timer_stop(0)
      t = CM_timer_read_cm_busy(0)
      PRINT *
      PRINT *,'Vector length = ', n
      PRINT *,'Timing calibration: Time = ',t*100,' hundredths',
     $  ' of a second'
      PRINT *,'Increase the size of the arrays if this is < 30'
      PRINT *,'and your clock precision is = < 1/100 second'

*       --- MAIN LOOP --- repeat test cases NTIMES times ---
      DO 60 k = 1, ntimes

          CALL CM_timer_clear(1)
          CALL CM_timer_start(1)
          c = a
          CALL CM_timer_stop(1)
          t = CM_timer_read_elapsed(1)
          times(1,k) = t

          CALL CM_timer_clear(2)
          CALL CM_timer_start(2)
          c = 3.0 * a
          CALL CM_timer_stop(2)
          t = CM_timer_read_elapsed(2)
          times(2,k) = t

          CALL CM_timer_clear(3)
          CALL CM_timer_start(3)
          c = a + b
          CALL CM_timer_stop(3)
          t = CM_timer_read_elapsed(3)
          times(3,k) = t

          CALL CM_timer_clear(4)
          CALL CM_timer_start(4)
          c = a + 3.0 * b
          CALL CM_timer_stop(4)
          t = CM_timer_read_elapsed(4)
          times(4,k) = t

   60 CONTINUE

*       --- SUMMARY ---
      rmstime = SUM(times**2, DIM=2)
      rmstime = SQRT( rmstime/float(ntimes) )
      mintime = MINVAL(times, DIM=2)
      maxtime = MAXVAL(times, DIM=2)
      WRITE (*,FMT=9000)
      DO 90 j = 1,4
          WRITE (*,FMT=9010) label(j),n*bytes(j)*nbpw/mintime(j)/1.0e6,
     $      rmstime(j),mintime(j),maxtime(j)
   90 CONTINUE

 9000 FORMAT (/1x,57('-'),/,' Function  :',1x,
     $        'Rate (MB/s)  RMS time    Min time    Max time')
 9010 FORMAT (a,4(f10.4,2x))
      END


*-------------------------------------
* INTEGER FUNCTION realsize()
*
* A semi-portable way to determine the precision of default REAL
* in Fortran.
* Here used to guess how many bytes of storage a real number occupies.
*
      INTEGER FUNCTION realsize()

C       Test #1 - compare double precision pi to acos(-1.0e0)

C     .. Local Scalars ..
      DOUBLE PRECISION pi
      REAL diff,picalc,result,test
      INTEGER j,ndigits
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION ref(30)
C     ..
C     .. External Subroutines ..
      EXTERNAL dummy
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC abs,acos,log10,sqrt
C     ..
      pi = 3.14159265358979323846264338327950288d0
      picalc = acos(-1.0e0)
      diff = abs(picalc-pi)
      IF (diff.EQ.0.0) THEN
          PRINT *,'Test #1 Failed = picalc=piexact'
          PRINT *,'Apparently Single=Double Precision'
          PRINT *,'Proceeding to Test #2'
          PRINT *,' '
          GO TO 10
      ELSE
          ndigits = -log10(abs(diff)) + 0.5
          GO TO 50
      END IF

C       Test #2 - compare single(1.0d0+delta) to 1.0e0

   10 DO 20 j = 1,30
          ref(j) = 1.0d0 + 10.0d0** (-j)
   20 CONTINUE

      DO 30 j = 1,30
          test = ref(j)
          ndigits = j
          CALL dummy(test,result)
          IF (test.EQ.1.0e0) THEN
              GO TO 50
          END IF
   30 CONTINUE
      PRINT *,'Test #2 failed - Precision appears to exceed 30 digits'
      PRINT *,'Proceeding to Test #3'
      GO TO 40

C       Test #3 - abs(sqrt(1.0d0)-sqrt(1.0e0))

   40 diff = abs(sqrt(1.0d0)-sqrt(1.0e0))
      IF (diff.EQ.0.0) THEN
          PRINT *,'Test Failed - sqrt(1.0e0)=sqrt(1.0d0)'
          PRINT *,'Apparently Single=Double Precision'
          PRINT *,'Giving up'
          GO TO 60
      ELSE
          ndigits = -log10(abs(diff)) + 0.5
          GO TO 50
      END IF


   50 WRITE (*,FMT='(a)') '--------------------------------------'
      WRITE (*,FMT='(1x,a,i2,a)') 'Single precision appears to have ',
     $  ndigits,' digits of accuracy'
      IF (ndigits.LE.8) THEN
          realsize = 4
      ELSE
          realsize = 8
      END IF
      WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize,
     $  ' bytes per default REAL word'
      WRITE (*,FMT='(a)') '--------------------------------------'
      RETURN

   60 PRINT *,'Hmmmm.  I am unable to determine the size of a REAL'
      PRINT *,'Please enter the number of Bytes per REAL number : '
      READ (*,FMT=*) realsize
      IF (realsize.NE.4 .AND. realsize.NE.8) THEN
          PRINT *,'Your answer ',realsize,' does not make sense!'
          PRINT *,'Try again!'
          PRINT *,'Please enter the number of Bytes per ',
     $      'REAL number : '
          READ (*,FMT=*) realsize
      END IF
      PRINT *,'You have manually entered a size of ',realsize,
     $  ' bytes per REAL number'
      WRITE (*,FMT='(a)') '--------------------------------------'
      END

      SUBROUTINE dummy(q,r)
C     .. Scalar Arguments ..
      REAL q,r
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC cos
C     ..
      r = cos(q)
      RETURN
      END

From rsw@hydra.maths.unsw.EDU.AU  Thu Apr 29 21:30:14 1993
Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA07518; Thu, 29 Apr 93 21:30:14 -0400
Received: by hydra.maths.unsw.EDU.AU (5.61/1.35)
	id AA01867; Fri, 30 Apr 93 11:35:27 +1000
Date: Fri, 30 Apr 93 11:35:27 +1000
From: rsw@hydra.maths.unsw.EDU.AU
Message-Id: <9304300135.AA01867@hydra.maths.unsw.EDU.AU>
To: mccalpin
Subject: CM5 Stream_d results
Status: RO

 STREAM: Measure memory transfer rates in MB/s
 for simple computational kernels in Fortran

 SunOS Release 4.1.2 (CMGENERIC) #14:
 CMOST Version 7.2 beta1.1-P2: Thu Jan 21 14 :15:00 EST 1993

 cmf -O -o s_do -implicit_none stream_d.fcm
 cmf [CM5 VecUnit 2.1 Beta 0]


 CALL CMF_describe_array(a)

 Descriptor address    : 78b0c

  desc_or_obj_kind     :  array argument
  debug info: <nil>
  element_type         :  double float
  spare1               :  0
  spare2               :  0
  home                 :  cm
  cm_location          :  1343432712
  initial_data         :  ffffffff
  user_rank            :  1
  axes_extents         :  5000000 
  axes_layout_maps     :  1 
  is_modified?         :  no
  array_geometry       :  93c10
  spare6               :  ffffffff
  geometry_rank        :  1
  geometry_offsets     :  0 
  axes_extents_ptr     :  78e2c
  axes_maps_ptr        :  78db8
  geometry_offsets_ptr :  78db4
  debug_info_ptr       :  0
  view_or_thread_ptr   :  ffffffff
  is_slicewise         :  1
  element_size         :  8

Array geometry id: 0x93c10
  Rank: 1
  Number of elements: 5000000
  Extents: [5000000]
  Machine geometry id: 0x93bb0, rank: 1, column major
   Machine geometry elements:  5000192
   Overall subgrid size:       78128
  Axis 0:
   Extent:    5000192 (64 physical x 78128 subgrid)
   Off-chip:  6 bits, mask = 0x3f
   Subgrid:   length = 78128, axis-increment = 1


 CM5 with partition of 16 processors ( 64 vector units )

 --------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
 --------------------------------------

 Vector length =      5000000
 Timing calibration: Time =   2.826233333333334 hundredths of a second
 Increase the size of the arrays if this is < 30
 and your clock precision is =< 1/100 second

 ---------------------------------------------------------
 Function  : Rate (MB/s)  RMS time    Min time    Max time
 Assignment: 4881.2055      0.0164      0.0164      0.0164
 Scaling   : 4894.5720      0.0164      0.0163      0.0164
 Summing   : 7331.8805      0.0164      0.0164      0.0164
 SAXPYing  : 7333.7272      0.0164      0.0164      0.0164

From rsw@hydra.maths.unsw.EDU.AU  Thu Apr 29 21:30:42 1993
Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA07522; Thu, 29 Apr 93 21:30:42 -0400
Received: by hydra.maths.unsw.EDU.AU (5.61/1.35)
	id AA01969; Fri, 30 Apr 93 11:35:53 +1000
Date: Fri, 30 Apr 93 11:35:53 +1000
From: rsw@hydra.maths.unsw.EDU.AU
Message-Id: <9304300135.AA01969@hydra.maths.unsw.EDU.AU>
To: mccalpin
Subject: CM5 Stream_s results
Status: RO

 STREAM: Measure memory transfer rates in MB/s
 for simple computational kernels in Fortran

 SunOS Release 4.1.2 (CMGENERIC) #14:
 CMOST Version 7.2 beta1.1-P2: Thu Jan 21 14 :15:00 EST 1993

 cpio% cmf -O -o s_so -implicit_none stream_s.fcm
 cmf [CM5 VecUnit 2.1 Beta 0]


 CALL CMF_describe_array(a)

 Descriptor address    : 7aaac

  desc_or_obj_kind     :  array argument
  debug info: <nil>
  element_type         :  float
  spare1               :  0
  spare2               :  0
  home                 :  cm
  cm_location          :  1342807592
  initial_data         :  ffffffff
  user_rank            :  1
  axes_extents         :  5000000 
  axes_layout_maps     :  1 
  is_modified?         :  no
  array_geometry       :  95fd0
  spare6               :  ffffffff
  geometry_rank        :  1
  geometry_offsets     :  0 
  axes_extents_ptr     :  7adcc
  axes_maps_ptr        :  7ad58
  geometry_offsets_ptr :  7ad54
  debug_info_ptr       :  0
  view_or_thread_ptr   :  ffffffff
  is_slicewise         :  1
  element_size         :  4

Array geometry id: 0x95fd0
  Rank: 1
  Number of elements: 5000000
  Extents: [5000000]
  Machine geometry id: 0x95f70, rank: 1, column major
   Machine geometry elements:  5000192
   Overall subgrid size:       78128
  Axis 0:
   Extent:    5000192 (64 physical x 78128 subgrid)
   Off-chip:  6 bits, mask = 0x3f
   Subgrid:   length = 78128, axis-increment = 1

 CM5 with partition of 16 processors ( 64 vector units )

--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------

 Vector length =      5000000
 Timing calibration: Time =   6.677066 hundredths of a second
 Increase the size of the arrays if this is < 30
 and your clock precision is = < 1/100 second

 ---------------------------------------------------------
 Function  : Rate (MB/s)  RMS time    Min time    Max time
 Assignment: 1420.9683      0.0282      0.0281      0.0295
 Scaling   : 1422.6085      0.0282      0.0281      0.0295
 Summing   : 2133.0415      0.0281      0.0281      0.0282
 SAXPYing  : 2133.1265      0.0282      0.0281      0.0295

From rsw@hydra.maths.unsw.EDU.AU  Mon May  3 07:43:06 1993
Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA14801; Mon, 3 May 93 07:43:06 -0400
Received: by hydra.maths.unsw.EDU.AU (5.61/1.35)
	id AA15342; Mon, 3 May 93 21:48:33 +1000
Date: Mon, 3 May 93 21:48:33 +1000
From: rsw@hydra.maths.unsw.EDU.AU
Message-Id: <9305031148.AA15342@hydra.maths.unsw.EDU.AU>
To: mccalpin
Subject: Re:  CM5 Stream_d results
Status: RO

The man page states the CM5 timers have microsecond precision.

I tried a few other vector lengths, and the results are remarkably
consistent. The runs were made when I had the partition virtually to myself.

I have had varying results timing operations that involved communications,
but not these. Would like to see results from some other CM5 sites,
but Australia only has 32 PN machines.

Rob

 Compiled with NO optimization: cmf -implicit_none stream_d.fcm

 STREAM: Measure memory transfer rates in MB/s
 for simple computational kernels in Fortran

 CALL CMF_describe_array(a)

  desc_or_obj_kind     :  array argument
  element_type         :  double float
  home                 :  cm
  user_rank            :  1
  axes_extents         :  12800000 
  axes_layout_maps     :  1 
  element_size         :  8

Array geometry id: 0x93b28
  Rank: 1
  Number of elements: 12800000
  Extents: [12800000]
  Machine geometry id: 0x93ac8, rank: 1, column major
   Machine geometry elements:  12800000
   Overall subgrid size:       200000
  Axis 0:
   Extent:    12800000 (64 physical x 200000 subgrid)
   Off-chip:  6 bits, mask = 0x3f
   Subgrid:   length = 200000, axis-increment = 1


 CM5 with partition of 16 processors ( 64 vector units )

 --------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
 --------------------------------------

 Vector length =     12800000
 Timing calibration: Time =   7.229106060606060 hundredths of a second
 Increase the size of the arrays if this is < 30
 and your clock precision is =< 1/100 second

 ---------------------------------------------------------
 Function  : Rate (MB/s)   RMS time    Min time    Max time
 Assignment: 4885.82465     0.04206     0.04192     0.04332
 Scaling   : 4894.24551     0.04185     0.04185     0.04187
 Summing   : 5129.65827     0.05996     0.05989     0.06128
 SAXPYing  : 5129.20148     0.05997     0.05989     0.06130

===================================================================

 Compiled with Optimization: cmf -O -implicit_none stream_d.fcm

 STREAM: Measure memory transfer rates in MB/s
 for simple computational kernels in Fortran

 CALL CMF_describe_array(a)

  desc_or_obj_kind     :  array argument
  element_type         :  double float
  home                 :  cm
  user_rank            :  1
  axes_extents         :  6400000 
  axes_layout_maps     :  1 
  element_size         :  8

Array geometry id: 0x93c10
  Rank: 1
  Number of elements: 6400000
  Extents: [6400000]
  Machine geometry id: 0x93bb0, rank: 1, column major
   Machine geometry elements:  6400000
   Overall subgrid size:       100000
  Axis 0:
   Extent:    6400000 (64 physical x 100000 subgrid)
   Off-chip:  6 bits, mask = 0x3f
   Subgrid:   length = 100000, axis-increment = 1


 CM5 with partition of 16 processors ( 64 vector units )

 --------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
 --------------------------------------

 Vector length =      6400000
 Timing calibration: Time =   3.616887878787879 hundredths of a second
 Increase the size of the arrays if this is < 30
 and your clock precision is =< 1/100 second

 ---------------------------------------------------------
 Function  : Rate (MB/s)   RMS time    Min time    Max time
 Assignment: 4882.29863     0.02098     0.02097     0.02101
 Scaling   : 4894.68150     0.02092     0.02092     0.02093
 Summing   : 7333.00445     0.02095     0.02095     0.02096
 SAXPYing  : 7335.86990     0.02094     0.02094     0.02099

From rsw@hydra.maths.unsw.EDU.AU  Wed May  5 18:44:42 1993
Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA23379; Wed, 5 May 93 18:44:42 -0400
Received: by hydra.maths.unsw.EDU.AU (5.61/1.35)
	id AA06281; Thu, 6 May 93 08:50:19 +1000
Date: Thu, 6 May 93 08:50:19 +1000
From: rsw@hydra.maths.unsw.EDU.AU
Message-Id: <9305052250.AA06281@hydra.maths.unsw.EDU.AU>
To: mccalpin
Subject: Re:  CM5 Stream_d results
Status: RO

John

The optimized results for the stream benchmark on a CM5 do look funny.
I changed the constant 3.0D0 in the SAXPY operation to DBLE(k),
where k is the loop index, and this inhibited whatever optimization
was going on, to produce the results below. I also noticed that the
unoptimized code could handle much larger problems (200,000 elements per VU)
while the optimized code ran out of memory well before then (an extra temporary)
Will get, and forward assembly listing of just the SAXPY loop (if you are
interested).  Have you go any other info on how stream runs on a CM5?
For large enough vectors I would expect it to scale well to larger machines.

Rob

=============================================================================

 NO Optimization: cmf stream_d.fcm

 SAXPY with constant from loop index
          s = DBLE(k)
          CALL CM_timer_clear(4)
          CALL CM_timer_start(4)
          c = a + s * b
          CALL CM_timer_stop(4)
          t = CM_timer_read_elapsed(4)
          times(4,k) = t


 STREAM: Measure memory transfer rates in MB/s
 for simple computational kernels in Fortran


 CM5 with partition of 16 processors ( 64 vector units )

 --------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
 --------------------------------------

 Array length =      9600000
 Elements per VU =       150000
 Timing calibration: Time =   5.455848484848485 hundredths of a second
 Increase the size of the arrays if this is < 30
 and your clock precision is =< 1/100 second

 ---------------------------------------------------------
 Function  : Rate (MB/s)   RMS time    Min time    Max time
 Assignment: 4887.33331     0.03143     0.03143     0.03145
 Scaling   : 4893.41038     0.03153     0.03139     0.03276
 Summing   : 5145.05312     0.04500     0.04478     0.04615
 SAXPYing  : 5143.42075     0.04501     0.04480     0.04616

 =======================================================================

 Optimization ON: cmf -O stream_d.fcm

 STREAM: Measure memory transfer rates in MB/s
 for simple computational kernels in Fortran


 CM5 with partition of 16 processors ( 64 vector units )

 --------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
 --------------------------------------

 Array length =      9600000
 Elements per VU =       150000
 Timing calibration: Time =   5.468930303030303 hundredths of a second
 Increase the size of the arrays if this is < 30
 and your clock precision is =< 1/100 second

 ---------------------------------------------------------
 Function  : Rate (MB/s)   RMS time    Min time    Max time
 Assignment: 4894.00569     0.03139     0.03139     0.03142
 Scaling   : 4898.13904     0.03144     0.03136     0.03273
 Summing   : 7328.59745     0.03158     0.03144     0.03280
 SAXPYing  : 5143.05891     0.04494     0.04480     0.04615

From derek_robb@corwin.cray.com  Thu May  6 14:53:49 1993
Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA26030; Thu, 6 May 93 14:53:49 -0400
Received: from corwin.cray.com by bach.udel.edu with SMTP
	(5.65c/IDA-1.2.8) id AA12931; Thu, 6 May 1993 14:57:58 -0400
Message-Id: <199305061857.AA12931@bach.udel.edu>
Date: 6 May 93 12:56:23 U
From: "Derek Robb" <derek_robb@corwin.cray.com>
Subject: STREAM Benchmark
To: mccalpin@bach.udel.edu
Status: RO

                       Subject:                               Time:12:57 PM
  OFFICE MEMO          STREAM Benchmark                       Date:5/6/93
I received a fax copy of your STREAM benchmark results for about 100 systems
from Robert Bell at CSIRO. I would like to have these in digital form. Would
you be kind enough to e-mail these to me.

Thanks,

Derek Robb


From morse@mprgate.mpr.ca  Fri May  7 11:31:13 1993
Received: from [134.87.131.13] by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA29121; Fri, 7 May 93 11:31:13 -0400
Received: from quark.mpr.ca by mprgate.mpr.ca with SMTP id AA12379
  (5.65c/IDA-1.4.4 for <mccalpin@perelandra.cms.udel.edu>); Fri, 7 May 1993 08:35:18 -0700
Received: by quark.mpr.ca (5.57/Ultrix3.0-C)
	id AA17608; Fri, 7 May 93 08:35:15 -0700
Date: Fri, 7 May 93 08:35:15 -0700
From: morse@mprgate.mpr.ca (Daryl Morse)
Message-Id: <9305071535.AA17608@quark.mpr.ca>
To: mccalpin (John D. McCalpin)
In-Reply-To: mccalpin@perelandra.cms.udel.edu's message of Thu, 6 May 1993 16:53:08 GM
Subject: qgbox on HP9000/755 ?
Status: RO


   Newsgroups: comp.benchmarks
   From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
   Nntp-Posting-Host: perelandra.cms.udel.edu
   Organization: College of Marine Studies, U. Del.
   Distribution: usa
   Date: Thu, 6 May 1993 16:53:08 GMT

>   I am trying to update the results table for a benchmark code of
>   mine (qgbox) and do not have access to one of the newer HP9000/7xx
>   machines, like the 735/755 models.

>   It hardly seems fair for me to be quoting IBM RS/6000-580 and
>   DEC 3000/500 results against the older HP 9000/730's.

Actually, if you really want to make it fair, you should try to get
numbers for DEC's 3000/500X, which is significantly faster than the
3000/500. If not the undiscounted price of the machine the tests were
run on, providing an indication of which machines are expensive
high-end server class machines, which are desktop, etc., would make
your numbers all the more interesting. Comparing servers against
servers and workstations against workstations is the most fair way to
go.

I point this out, because my suspicions tell me that the IBM RS6000 up
near the top of the heap is a very pricey server. (If I am wrong about
that, please disregard my suspicion.)

Please don't take my comments personally. The numbers are definitely
of interest.

Thanks.

Daryl Morse                     | Voice  : (604) 293-5476
MPR Teltech Ltd. 		| Fax    : (604) 293-5787
8999 Nelson Way, Burnaby, BC    | E-Mail : morse@mpr.ca
Canada, V5A 4B5                 |        : mprgate.mpr.ca!morse@uunet.uu.net

From desj@ccr-p.ida.org  Sun May  9 16:20:30 1993
Received: from idacrd.ccr-p.ida.org by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA05590; Sun, 9 May 93 16:20:30 -0400
Received: from runner.ida.org (runner.ccr-p.ida.org) by ccr-p.ida.org (4.1/SMI-4.1)
	id AA26226; Sun, 9 May 93 16:24:50 EDT
Date: Sun, 9 May 93 16:24:50 EDT
From: desj@ccr-p.ida.org (David desJardins)
Message-Id: <9305092024.AA26226@ccr-p.ida.org>
Received: by runner.ida.org (4.1/SMI-4.1)
	id AA25807; Sun, 9 May 93 16:24:49 EDT
To: mccalpin
Subject: Re: SPECint92, and other benchmarks (really LINPACK)
Newsgroups: comp.sys.sgi.hardware,comp.benchmarks
In-Reply-To: <C6748w.H06@news.udel.edu>
References: <1993Apr13.225551.26609@texhrc.uucp> <gklaass.734819791@yorku.ca> <1rlmriINNq9q@usenet.pa.dec.com>
Organization: IDA Center for Communications Research, Princeton
Cc: 
Status: RO

In article <C6748w.H06@news.udel.edu> you write:
>The current winner for aggregate bandwidth is the Cray C90, which can
>sustain 105 GB/s from its shared main memory to/from its 16 cpus.
>The Thinking Machines CM-5 is a potential competitor here, but I have
>not been able to get results off of one yet....

A 1024 node CM-5 could sustain 90% or more of its peak bandwidth, which
is 1024 * 4 * 16M * 8 = 512 GB/s.  (It can run at over 99% of that, if
you are doing something especially trivial like adding up the entries in
a very large array.)  Of course this is just between the vector units
and their local memories.

                                        David desJardins

From tuna@lhotse.LCS.MIT.EDU  Tue May 11 18:55:49 1993
Received: from LHOTSE.LCS.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA12494; Tue, 11 May 93 18:55:49 -0400
Received: by lhotse.LCS.MIT.EDU 
	id AA28381; Tue, 11 May 93 19:00:17 -0400
Date: Tue, 11 May 93 19:00:17 -0400
From: tuna@lhotse.LCS.MIT.EDU (Kirk 'UhOh' Johnson)
Message-Id: <9305112300.AA28381@lhotse.LCS.MIT.EDU>
To: mccalpin
Subject: stream results for SS10/30
Reply-To: Kirk Johnson <tuna@HING.LCS.MIT.EDU>
Status: RO


i noticed your stream results don't include any measurements for
non-supercache SS10 systems, so i went ahead and compiled things up on
one of our SS10/30s and ran some measurements.

the specific details are:

- SS10/30 (36 MHz, 20/16 kbyte on-chip I/D caches)
- SunOS 4.1.3
- compiled with "f77 -O4 -Bstatic" (using whatever version of f77 we
  got from sun at the same time they shipped us sun C 1.0; it's almost
  cetainly not the latest and greatest FORTRAN compiler from sun)


results from five consecutive runs of "stream_s":
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     255.000 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   33.8984      0.5910      0.5900      0.6000
Scaling   :   41.6667      0.4820      0.4800      0.4900
Summing   :   38.9610      0.7790      0.7700      0.7800
SAXPYing  :   33.7079      0.8910      0.8900      0.9000
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     252.000 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   33.8983      0.5940      0.5900      0.6000
Scaling   :   41.6667      0.4840      0.4800      0.4900
Summing   :   38.9610      0.7780      0.7700      0.7800
SAXPYing  :   34.0909      0.8880      0.8800      0.8900
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     251.000 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   33.8983      0.5930      0.5900      0.6100
Scaling   :   41.6667      0.4820      0.4800      0.4900
Summing   :   38.9610      0.7780      0.7700      0.7800
SAXPYing  :   33.7079      0.8900      0.8900      0.8900
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     251.000 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   33.8983      0.5930      0.5900      0.6100
Scaling   :   41.6667      0.4850      0.4800      0.4900
Summing   :   38.9610      0.7810      0.7700      0.7900
SAXPYing  :   34.0909      0.8870      0.8800      0.8900
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     252.000 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   33.8983      0.5930      0.5900      0.6100
Scaling   :   41.6667      0.4820      0.4800      0.4900
Summing   :   38.9611      0.7810      0.7700      0.7900
SAXPYing  :   34.0909      0.8900      0.8800      0.9000


results from five consecutive runs of "stream_d":
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     303.99999339134 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   42.1053      0.5790      0.5700      0.5900
Scaling   :   46.1538      0.5280      0.5200      0.5400
Summing   :   45.5697      0.7910      0.7900      0.8000
SAXPYing  :   46.1539      0.7880      0.7800      0.7900
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     302.99999434501 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   42.1053      0.5790      0.5700      0.5800
Scaling   :   46.1539      0.5310      0.5200      0.5500
Summing   :   46.1539      0.7870      0.7800      0.7900
SAXPYing  :   46.1539      0.7880      0.7800      0.7900
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     301.99999436736 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   42.1053      0.5800      0.5700      0.5900
Scaling   :   46.1539      0.5281      0.5200      0.5500
Summing   :   46.1538      0.7890      0.7800      0.8000
SAXPYing  :   46.1539      0.7870      0.7800      0.7900
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     301.99999623001 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   42.1053      0.5790      0.5700      0.5800
Scaling   :   46.1540      0.5300      0.5200      0.5400
Summing   :   46.1539      0.7870      0.7800      0.7900
SAXPYing  :   46.1538      0.7870      0.7800      0.7900
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     302.99999527633 hundredths of a second
Increase the size of the arrays if this is <30
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   42.1053      0.5780      0.5700      0.5800
Scaling   :   46.1540      0.5291      0.5200      0.5500
Summing   :   46.1539      0.7880      0.7800      0.7900
SAXPYing  :   46.1539      0.7880      0.7800      0.7900


share and enjoy,

kirk

From tuna@kanchenjunga.LCS.MIT.EDU  Wed May 12 13:17:46 1993
Received: from KANCHENJUNGA.LCS.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA15631; Wed, 12 May 93 13:17:46 -0400
Received: by kanchenjunga.LCS.MIT.EDU 
	id AA06002; Wed, 12 May 93 13:22:16 -0400
Date: Wed, 12 May 93 13:22:16 -0400
From: tuna@kanchenjunga.LCS.MIT.EDU (Kirk 'UhOh' Johnson)
Message-Id: <9305121722.AA06002@kanchenjunga.LCS.MIT.EDU>
To: mccalpin
Subject:  stream results for SS10/30
Reply-To: Kirk Johnson <tuna@HING.LCS.MIT.EDU>
Status: RO


    thanks for the results. I will put them in the table today....

sure. we've also got two of the 50 MHz + 1 MB external cache processor
upgrade modules on order; when they show up here (late june?), i'll
run the same measurements on them as well ...

kirk

From cmg@ferrari.cray.com  Thu May 27 18:59:37 1993
Received: from timbuk.cray.com by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA26093; Thu, 27 May 93 18:59:37 -0400
Received: from magnet (magnet.cray.com) by cray.com (4.1/CRI-MX 2.19)
	id AA18038; Thu, 27 May 93 18:00:51 CDT
Received: by magnet (4.1/CRI-5.13)
	id AA04064; Thu, 27 May 93 18:00:46 CDT
From: cmg@ferrari.cray.com (Charles Grassl)
Message-Id: <9305272300.AA04064@magnet>
Subject: stream
To: mccalpin
Date: Thu, 27 May 93 18:00:42 CDT
X-Mailer: ELM [version 2.3 PL11]
Status: RO

Hello John;

Below are performance results for running the STREAM benchmark on an 8
CPU CRAY Y-MP EL98.  If you are maintaining and distributing a list with
STREAM results, could you please add the EL results to your list.

The CRAY Y-MP EL98 has up to 8 CPUs which run at 33.3 MHz.  Each CPU
has four floating point functional units.  Each functional unit can
produce one floating point add or (exclusive) one floating point
multiply per clock period.

Pairs of EL CPUs share four memory ports.  As you can see from the
data, the memory ports can sustain up to four CPUs.  For eight CPUs,
the Assignment, Scaling and Summing loops run much faster than for 4
CPUs because the additional CPUs use ports C and D.  For SAXPYing, the
additional CPUs only have use of port D.  (Actually, the pairs of CPUs
share all four ports.)

This benchmark is an interesting experiment for this memory
architecture!

John, I'm trying to figure out a way to characterize computer power by
a "weighted" memory bandwidth.  More memory "closer", or faster, would
count more than memory "farther", or slower.  My idea is to calculate
the second moment of memory defined as:

                      __
Memory Mement (MM) =  \   memory words * (rate)^2
                      /
                      --

The MM would have units of [ Mbyte * (Mbyte/s)^2 ]

For the rate, I could use the transfer rate measured by STREAM.  The
measurement of MM would entale running STREAM for all sizes of memory
and "integrating" or summing the contributions.

Does this look reasonable to you?  Would MM have any predictive powers
for computer performance?.

One other question:  What would be analogous to torque for the above
"moment of inertia"?

Regards,
-- 
Charles Grassl
Cray Research, Inc.
cmg@cray.com
(612) 683-3531 


====================================================
                     1 CPUs
====================================================
--------------------------------------
 Single precision appears to have 14 digits of accuracy
 Assuming 8 bytes per default REAL word
--------------------------------------
 Timing calibration ; time = 31.968054 hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  437.2382      0.1098      0.1098      0.1098  
Scaling   :  436.6548      0.1106      0.1099      0.1113  
Summing   :  536.1533      0.1346      0.1343      0.1349  
SAXPYing  :  476.7855      0.1510      0.1510      0.1510  
 STOP  (called by STREAM )
 CP: 1.345s,  Wallclock: 1.349s,  12.5% of 8-CPU Machine
 HWM mem: 9129405, HWM stack: 9004265, Stack overflows: 0
 

====================================================
                     2 CPUs
====================================================
--------------------------------------
 Single precision appears to have 14 digits of accuracy
 Assuming 8 bytes per default REAL word
--------------------------------------
 Timing calibration ; time = 15.945318 hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:  826.6837      0.0581      0.0581      0.0582  
Scaling   :  833.7715      0.0576      0.0576      0.0576  
Summing   : 1048.9873      0.0686      0.0686      0.0687  
SAXPYing  : 1078.1763      0.0668      0.0668      0.0669  
 STOP  (called by STREAM )
 CP: 1.354s,  Wallclock: 1.076s,  15.7% of 8-CPU Machine
 HWM mem: 9129405, HWM stack: 9004265, Stack overflows: 0
 

====================================================
                     4 CPUs
====================================================
--------------------------------------
 Single precision appears to have 14 digits of accuracy
 Assuming 8 bytes per default REAL word
--------------------------------------
 Timing calibration ; time = 8.152287 hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment: 1564.9253      0.0307      0.0307      0.0307  
Scaling   : 1569.8418      0.0306      0.0306      0.0307  
Summing   : 1933.8354      0.0373      0.0372      0.0374  
SAXPYing  : 1955.4611      0.0369      0.0368      0.0369  
 STOP  (called by STREAM )
 CP: 1.440s,  Wallclock: 0.772s,  23.3% of 8-CPU Machine
 HWM mem: 9139645, HWM stack: 9004265, Stack overflows: 0
 

====================================================
                     8 CPUs
====================================================
--------------------------------------
 Single precision appears to have 14 digits of accuracy
 Assuming 8 bytes per default REAL word
--------------------------------------
 Timing calibration ; time = 5.824599 hundredths of a second
 Increase the size of the arrays if this is <30  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment: 2362.8129      0.0244      0.0203      0.0279  
Scaling   : 2310.5194      0.0249      0.0208      0.0284  
Summing   : 2373.6665      0.0312      0.0303      0.0321  
SAXPYing  : 2363.7914      0.0305      0.0305      0.0305  
 STOP  (called by STREAM )
 CP: 2.011s,  Wallclock: 1.160s,  21.7% of 8-CPU Machine
 HWM mem: 9149885, HWM stack: 9004265, Stack overflows: 0

From csrcb@mel.dit.csiro.au  Tue Jun 22 22:04:39 1993
Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA16894; Tue, 22 Jun 93 22:04:39 -0400
Received: from shark.mel.dit.CSIRO.AU by bach.udel.edu with SMTP
	(5.65c/IDA-1.2.8) id AA19731; Tue, 22 Jun 1993 22:07:13 -0400
Received: by shark.mel.dit.csiro.au id AA17947
  (5.65c/IDA-1.4.4/DIT-1.3 for mccalpin@bach.udel.edu); Wed, 23 Jun 1993 12:07:27 +1000
From: Robert Bell <Robert.Bell@mel.dit.csiro.au>
Message-Id: <199306230207.AA17947@shark.mel.dit.csiro.au>
Subject: Stream Benchmark
To: mccalpin@bach.udel.edu
Date: Wed, 23 Jun 93 12:07:26 EST
X-Mailer: ELM [version 2.3 PL11]
Status: RO

John,
     I have been following your stream benchmark with some interest.
A few years ago, I developed an interest in measuring the performance
of computer memory systems, and have some codes which illustrate and
measure characteristics and performance.  I worked with Charles Grassl
from Cray on these codes some time ago.
     Amyway, I have just downloaded a copy of the latest summary of
the results, and have a query about the Y-MP EL 1 cpu results for the
Triad benchmark.  Is the stated figure of 476.8 correct?
If so, there is a curiosity in that the 2 cpu result is more than
twice as fast, which is hard to explain.  Is there a misprint, with
the true figure being 576.8 ?  This would be more consistent with the
                     ^
other Y-MP EL and Y-MP results.
Thanks
Rob. Bell		     (	email: csrcb@mel.dit.csiro.au  )
--
/ Robert.Bell@mel.dit.csiro.au |  CSIRO Supercomputing Facility Manager  \
| CSIRO Division of Information Technology, Supercomputing Support Group |
| 723 Swanston Street          |  tel: +61 3 282 2620 or +61 018 108 333 |
\ Carlton VIC 3053 Australia   |  fax: +61 3 282 2600                    /

From tuna@spica.LCS.MIT.EDU  Wed Jun 30 10:33:08 1993
Received: from SPICA.LCS.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA01755; Wed, 30 Jun 93 10:33:08 -0400
Received: by spica.LCS.MIT.EDU 
	id AA00480; Wed, 30 Jun 93 10:35:47 -0400
Date: Wed, 30 Jun 93 10:35:47 -0400
From: tuna@spica.LCS.MIT.EDU (Kirk 'UhOh' Johnson)
Message-Id: <9306301435.AA00480@spica.LCS.MIT.EDU>
To: mccalpin
Subject: stream results for SS10/51
Reply-To: Kirk Johnson <tuna@HING.LCS.MIT.EDU>
Status: RO


here are results for your stream benchmarks running on an SS10/51 (50
MHz SuperSparc, 1 MB external cache), compiled with "f77 -O" using
whatever version of fortran sun shipped along with their sun C 1.0
product.


--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     55.0000 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   44.4444      0.1011      0.0900      0.1100
Scaling   :   40.0000      0.1031      0.1000      0.1100
Summing   :   42.8572      0.1471      0.1400      0.1500
SAXPYing  :   42.8572      0.1461      0.1400      0.1500
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     55.0000 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   40.0000      0.1041      0.1000      0.1100
Scaling   :   40.0000      0.1041      0.1000      0.1100
Summing   :   42.8571      0.1481      0.1400      0.1500
SAXPYing  :   42.8572      0.1431      0.1400      0.1500
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
Timing calibration ; time =     56.0000 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   44.4446      0.1032      0.0900      0.1100
Scaling   :   40.0000      0.1031      0.1000      0.1100
Summing   :   42.8572      0.1471      0.1400      0.1500
SAXPYing  :   42.8572      0.1461      0.1400      0.1500


--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     60.999995470047 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   43.6365      0.1221      0.1100      0.1300
Scaling   :   43.6363      0.1171      0.1100      0.1200
Summing   :   42.3530      0.1780      0.1700      0.1800
SAXPYing  :   42.3530      0.1720      0.1700      0.1800
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     59.999997913837 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   40.0000      0.1241      0.1200      0.1300
Scaling   :   43.6363      0.1190      0.1100      0.1200
Summing   :   42.3530      0.1771      0.1700      0.1800
SAXPYing  :   42.3529      0.1780      0.1700      0.1800
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
Timing calibration ; time =     61.000002920628 hundredths of a second
Increase the size of the arrays if this is <30 
 and your clock precision is =<1/100 second
---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   40.0000      0.1251      0.1200      0.1300
Scaling   :   43.6365      0.1161      0.1100      0.1200
Summing   :   42.3530      0.1771      0.1700      0.1800
SAXPYing  :   42.3529      0.1731      0.1700      0.1800


share-n-enjoy,

kirk

From news.udel.edu!darwin.sura.net!howland.reston.ans.net!ux1.cso.uiuc.edu!sdd.hp.com!decwrl!pa.dec.com!uvo.dec.com!helles.unt.dec.com!ryn.mro4.dec.com!msbcs.enet.dec.com!bhandarkar Thu Jul 15 16:59:45 EDT 1993
Article: 41327 of comp.arch
Newsgroups: comp.arch
Path: news.udel.edu!darwin.sura.net!howland.reston.ans.net!ux1.cso.uiuc.edu!sdd.hp.com!decwrl!pa.dec.com!uvo.dec.com!helles.unt.dec.com!ryn.mro4.dec.com!msbcs.enet.dec.com!bhandarkar
From: bhandarkar@msbcs.enet.dec.com (Dileep Bhandarkar)
Subject: Re: Looking for info. on DEC 3000 AXP 400
Message-ID: <CA81Hu.IsD@ryn.mro4.dec.com>
Sender: news@ryn.mro4.dec.com (USENET News System)
Organization: Digital Equipment Corporation
References:   <1993Jul13.174722.10241@eecs.nwu.edu>
Date: Thu, 15 Jul 1993 20:39:23 GMT
Lines: 13
Status: RO


In article <1993Jul13.174722.10241@eecs.nwu.edu>, shil@nasser.eecs.nwu.edu (Lei Shi) writes...
>Hello:
> 
>	I am looking for the following information about the DEC 3000 AXP model 400.  
>	My questions are:
>1. What are the read miss penalty and write miss penalty in terms of the CPU
>cycles for the first level cache if the data is in the second level cache?  
>2. What are the read miss penalty and write miss penalty in terms of the CPU
>cycles for the second level cache if the data is in the main memory?
> 
The second level cache access time is 5 cycles. Memory access time is 27 cycles.
Clock rate is 133 MHz.


From news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno Fri Jul 16 09:44:00 EDT 1993
Article: 41343 of comp.arch
Newsgroups: comp.arch
Path: news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno
From: maeno@is.titech.ac.jp (Toshinori Maeno)
Subject: Re: Looking for info. on DEC 3000 AXP 400
References: <1993Jul13.174722.10241@eecs.nwu.edu> 
    <CA81Hu.IsD@ryn.mro4.dec.com>
Message-ID: <1993Jul16.104143.27476@is.titech.ac.jp>
Date: Fri, 16 Jul 1993 10:41:43 GMT
Organization: Dept. of Information Science, Tokyo Institute of Technology, 
    Tokyo, JAPAN
X-Bytes: 1060
Lines: 27
Status: RO

In article <CA81Hu.IsD@ryn.mro4.dec.com> bhandarkar@msbcs.enet.dec.com (Dileep Bhandarkar) writes:
>
>In article <1993Jul13.174722.10241@eecs.nwu.edu>, shil@nasser.eecs.nwu.edu 
 (Lei Shi) writes...

>> I am looking for the following information about the DEC 3000 AXP model 400.
>>	My questions are:
>>1. What are the read miss penalty and write miss penalty in terms of the CPU
>>cycles for the first level cache if the data is in the second level cache?  
>>2. What are the read miss penalty and write miss penalty in terms of the CPU
>>cycles for the second level cache if the data is in the main memory?
>> 
>The second level cache access time is 5 cycles. Memory access time is 27 cycles.
>Clock rate is 133 MHz.

My measurement for TITAN2-400 (Alpha 133MHz) tells,
  1. read miss penalty is 8 cycles for read, 12 cycles for write for the
first level cache when the data is in the second level cache.

  2. read miss penalty is 12 cycles for read, 40 cycles for write when
the data is only in the memory.

Toshinori Maeno
Tokyo Institute of Technology


From news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno Fri Jul 16 09:44:12 EDT 1993
Article: 41344 of comp.arch
Newsgroups: comp.arch
Path: news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno
From: maeno@is.titech.ac.jp (Toshinori Maeno)
Subject: Re: Looking for info. on DEC 3000 AXP 400
References: <1993Jul13.174722.10241@eecs.nwu.edu> 
    <CA81Hu.IsD@ryn.mro4.dec.com> <1993Jul16.104143.27476@is.titech.ac.jp>
Message-ID: <1993Jul16.104701.27493@is.titech.ac.jp>
Date: Fri, 16 Jul 1993 10:47:01 GMT
Organization: Dept. of Information Science, Tokyo Institute of Technology, 
    Tokyo, JAPAN
X-Bytes: 597
Lines: 17
Status: RO

Sorry for my mistake in my last posting.

In article <1993Jul16.104143.27476@is.titech.ac.jp> maeno@is.titech.ac.jp (Toshinori Maeno) writes:

>My measurement for TITAN2-400 (Alpha 133MHz) tells,
>  1. read miss penalty is 8 cycles for read, 12 cycles for write for the
                                              ==
                                              34 is correct

>first level cache when the data is in the second level cache.
>
>  2. read miss penalty is 12 cycles for read, 40 cycles for write when
>the data is only in the memory.

Toshinori Maeno
Tokyo Institute of Technology


From news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!decwrl!deccrl!news.crl.dec.com!stewart Fri Jul 16 22:15:58 EDT 1993
Article: 41359 of comp.arch
Newsgroups: comp.arch
Path: news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!decwrl!deccrl!news.crl.dec.com!stewart
From: stewart@crl.dec.com (Larry Stewart)
Subject: Re: Looking for info. on DEC 3000 AXP 400
Message-ID: <1993Jul16.211105.6132@crl.dec.com>
Sender: news@crl.dec.com (USENET News System)
Reply-To: stewart@crl.dec.com
Organization: DEC Cambridge Research Lab
References: <1993Jul13.174722.10241@eecs.nwu.edu> <CA81Hu.IsD@ryn.mro4.dec.com> <1993Jul16.104143.27476@is.titech.ac.jp> <m4dh1bINNnee@exodus.Eng.Sun.COM>
Date: Fri, 16 Jul 1993 21:11:05 GMT
Lines: 98
Status: RO

In article <m4dh1bINNnee@exodus.Eng.Sun.COM>, tremblay@flayout.Eng.Sun.COM (Marc Tremblay) writes:
> In article <1993Jul16.104143.27476@is.titech.ac.jp> maeno@is.titech.ac.jp (Toshinori Maeno) writes:
> >  2. read miss penalty is 12 cycles for read, 40 cycles for write when
> >the data is only in the memory.
> 
> Maybe someone from DEC can explain the discrepancy between the 12 cycles
> claimed here and the 27 cycles that was claimed in a previous message
> for a second level read miss.
> 
> - Marc Tremblay.
> Sun Microsystems.

That's easy.  The 12 cycle number is wrong.  Dileep's message was accurate,
but it is pretty easy to draw wrong conclusions from the numbers, and quite
difficult to actually measure them.

What the hardware does (read):

	1 cycle 1st level cache access
	5 cycle second level cache access
	27 cycle main memory access

What the programmer sees:

The program keeps running after the LD is issued, and stalls only
if the destination register of the LD is touched before the data
gets there.  Consequently, in order to measure the load latency,
you have to do something like the following:

1) Perform a series of references to assure that the test reference
will be a hit or miss in the appropriate cache.

2) Do an MB to make sure the write buffers are flushed.

3) Let the pin bus become idle, to assure your test reference is
not stalled behind some other activity.

4) execute a test code sequence, consisting (typically)
of:
	RCC	; read cycle counter
	LD	; test load instruction
	ADD	; some instruction to touch the result register
	RCC	; to read the cycle counter again.

Of course it isn't that simple, since you have to know the align-
ment of the instructions, and which ones will dual-issue with which.

5) Correct the result of the measurement for the effects of the
RCC instructions.

In fact, the 21064 can have two outstanding loads.  The third load will stall,
and there are some other wierd stall conditions, read the 21064 data book.

The answers given by this measurement are (I think) 3-cycle latency for
the first level cache  (load-use penalty) and 8 for the external cache,
because it takes a couple of cycles to get the pipes moving and to get
the address to the pins, before you can start the cache access.
The rep-rates are 1 cycle for the internal cache and 5 for the external.

Measuring store latency is even harder.  First you have to decide
what it means, since the 21064 will store up to 128 bytes of write data
in the write buffers before incuring ANY delays to the program.

One thing it might mean is how long does it take before the cells in
the DRAMS get new values.  Who cares?

Another thing it might mean is how long does it take to write a value
and then read it back.  (Pushing arguments and calling a procedure, which
pops them might have performance limited by such a path.)
I <think> on the 3000/400 this path requires that the write data reach
the external cache and then can be read back in via a second level cache hit.
The performance of this path is on the order of 20 cycles, and can be
measured using the cycle counter as above.  Of course a good compiler
might pass arguments in registers, since reading back something you've
just written can be expensive on many modern machines.

Writes to the secondary cache take longer than reads because the chip
must make two cache accesses to do a write.  The first access is a read
to check the tag store and dirty bits.  The second access writes the data
and updates the dirty bit.  If the relevant write buffer entry contains
data from both halves of the 32 byte cache line, then the write will take
three 5-cycle cache accesses, because the pin bus is only 16 bytes wide.

So my view is that asking "what are the read and write latencies" is
interesting, but can be simplistic, since you cannot use the answers
to predict anything in particular.  You cannot add read-latencies to
calculate bandwidth, because the memory is delivering 32 bytes per access,
not just what you asked for.  In any case, the memory rep-rate is NOT
the same as the latency.  The situation for writes is more clear, write-latency
is very nearly uninteresting, since it has almost nothing to do with
performance.  Write-bandwidth is interesting, and the fact that write
activity limits the bandwidth available for reads is interesting, but
who cares how long it takes?  (other than a programmed I/O device driver.)

-Larry Stewart
-- 
Digital Equipment Corporation
Cambridge Research Laboratory


From news.udel.edu!udel!gatech!swrinde!elroy.jpl.nasa.gov!ames!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno Sat Jul 17 07:46:01 EDT 1993
Article: 41362 of comp.arch
Newsgroups: comp.arch
Path: news.udel.edu!udel!gatech!swrinde!elroy.jpl.nasa.gov!ames!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno
From: maeno@is.titech.ac.jp (Toshinori Maeno)
Subject: Re: Looking for info. on DEC 3000 AXP 400
References: <CA81Hu.IsD@ryn.mro4.dec.com> 
    <1993Jul16.104143.27476@is.titech.ac.jp> <m4dh1bINNnee@exodus.Eng.Sun.COM>
Message-ID: <1993Jul17.024952.3137@is.titech.ac.jp>
Date: Sat, 17 Jul 1993 02:49:52 GMT
Organization: Dept. of Information Science, Tokyo Institute of Technology, 
    Tokyo, JAPAN
X-Bytes: 1186
Lines: 27
Status: RO

In article <m4dh1bINNnee@exodus.Eng.Sun.COM> tremblay@flayout.Eng.Sun.COM (Marc Tremblay) writes:
>In article <1993Jul16.104143.27476@is.titech.ac.jp> maeno@is.titech.ac.jp (Toshinori Maeno) writes:
>>  2. read miss penalty is 12 cycles for read, 40 cycles for write when
>>the data is only in the memory.
>
>At 133 MHz, 12 cycles represent 90ns. Given the size of main memory
>(the SPEC92 numbers were obtained on a machine with 128 MB), 90ns for
>the miss processing and for main memory latency would put tough
>constraints on the DRAMs. I suspect that 12 cycles are what is measured
>from address out to data in and does not include the overhead for
>the miss handling and bringing the data into the pipeline.
>
>Maybe someone from DEC can explain the discrepancy between the 12 cycles
>claimed here and the 27 cycles that was claimed in a previous message
>for a second level read miss.
>Sun Microsystems.

I am very sorry, I was confused and made a second mistake.

>>  2. read miss penalty is 12 cycles for read, 40 cycles for write when
                            == 
                            33 was the measured cycles.
>>the data is only in the memory.

Toshinori Maeno
  

From mccalpin@strauss.udel.edu Sat Sep 11 06:31:29 1993
Received: from strauss.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA04603; Sat, 11 Sep 93 06:31:27 -0400
Return-Path: <mccalpin@strauss.udel.edu>
Received: from localhost (mccalpin@localhost) by strauss.udel.edu (8.5/8.5) id GAA05081; Sat, 11 Sep 1993 06:33:53 -0400
Date: Sat, 11 Sep 1993 06:33:53 -0400
From: John D McCalpin <mccalpin@strauss.udel.edu>
Message-Id: <199309111033.GAA05081@strauss.udel.edu>
To: mccalpin
Subject: Stream on Sun/2000
Content-Length: 1631
Status: RO

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =     592.05609671772 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   34.4371      0.9395      0.9292      0.9596
Scaling   :   36.6877      0.8909      0.8722      0.9106
Summing   :   35.2337      1.3788      1.3623      1.3948
SAXPYing  :   35.0374      1.3771      1.3700      1.3870
 Note: this program was linked with -fast or -fnonstd 
 and so may have produced nonstandard floating-point results. 
 Sun's implementation of IEEE arithmetic is discussed in 
 the Numerical Computation Guide.

real       52.57
user       48.14
sys         3.68

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =     132.00789839029 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   36.6677      0.2223      0.2182      0.2462
Scaling   :   34.6741      0.2327      0.2307      0.2355
Summing   :   37.3276      0.3243      0.3215      0.3299
SAXPYing  :   35.6657      0.3395      0.3365      0.3452

From mccalpin@cacr  Ukn Sep 20 10:02:11 1993
Received: from cacr.coastal.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA02591; Mon, 20 Sep 93 10:02:10 -0400
Return-Path: <mccalpin@cacr>
Received: by cacr (920330.SGI/920502.SGI.AUTO)
	for mccalpin@perelandra.cms.udel.edu id AA10035; Mon, 20 Sep 93 10:10:28 -0400
Date: Mon, 20 Sep 93 10:10:28 -0400
From: mccalpin@cacr (John D. McCalpin)
Message-Id: <9309201410.AA10035@cacr>
To: mccalpin
Subject: stream_d on Indigo R4000
Status: RO
X-Status: 

make stream_d
        f77 -O -mips2 stream_d.f  -o stream_d
 /usr/bin/timex stream_d
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    94.99999694526196     hundredths of a second
 Increase the size of the arrays if this is <30
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   55.1724      0.3442      0.2900      0.5200
Scaling   :   53.3333      0.3219      0.3000      0.3800
Summing   :   52.1740      0.5025      0.4600      0.5800
SAXPYing  :   53.3334      0.4926      0.4500      0.5700

real       22.82
user       15.22
sys         2.31

From news.udel.edu!udel!wupost!howland.reston.ans.net!pipex!uknet!pavo.csi.cam.ac.uk!cast0.ast.cam.ac.uk!drtr Fri Oct  8 10:35:29 EDT 1993
Article: 43061 of comp.arch
Newsgroups: comp.arch
Path: news.udel.edu!udel!wupost!howland.reston.ans.net!pipex!uknet!pavo.csi.cam.ac.uk!cast0.ast.cam.ac.uk!drtr
From: drtr@mail.ast.cam.ac.uk (David Robinson)
Subject: Re: SPARCstation memory performance???
Message-ID: <1993Oct8.094903.4366@infodev.cam.ac.uk>
Sender: news@infodev.cam.ac.uk (USENET news)
Nntp-Posting-Host: coral.ast.cam.ac.uk
Organization: Institute of Astronomy, Cambridge
References: <JHOE.93Oct7234827@au-bon-pain.lcs.mit.edu>
Date: Fri, 8 Oct 1993 09:49:03 GMT
Lines: 51
Status: RO

In article <JHOE.93Oct7234827@au-bon-pain.lcs.mit.edu>, jhoe@au-bon-pain.lcs.mit.edu (James C. Hoe) writes:
|> 
|> Has anyone done any experiment or have detail knowledge of memory
|> performance on SPARC workstations (4/4*0, SS10).  I am particularly
|> interested in load and store on cache misses.
|> 
|> I have done some experiments and found my 40MHz SPARCstation2 uses
|> nearly 30 cycles on a load misses (I was told this was due to 
|> very slow memory translation).  SPARCstation2 also require 8 cycles
|> to store regardless of miss or hit.  
|> 
|> Have anyone done similar experiments, or know detail inner workings of
|> these workstations, especially SS10's.  I would appreciate any
|> correspondence.

For a SS10 with MCC and secondardy cache (10/41, 10/51 etc) the penalties are
Load miss primary cache, hit secondary cache:     5 cycles
Load miss primary cache, miss secondard cache:  ~80 cycles

For a SS10 without MCC, specifically 10/40, the penalty is
Load miss primary cache                         ~15 cycles

I suspect the penalty will be less for a 10/30, as cycles are longer.

Store penalties are harder to measure, and less useful to know, as there is
a four level store buffer. However, I have measured the peak store bandwidth:

Timings done using the std instruction.
All bandwidths in Mb/s
I = on-chip (internal) or primary cache,
E = off-chip (external) or secondary cache
Clock speed given either as CPU speed  or CPU speed/memory bus speed

        Hit I cache   Miss I cache     Miss I cache
                      Hit E cache      Miss E cache
10/40     308             -              46          40MHz, I=16Kb, E=0Kb
10/51     220            212             34          50MHz/40MHz, I=16Kb, E=1Mb
10/41     182            172             31        40.3MHz/40Mhz, I=16Kb, E=1Mb
Classic    62             -              64          50Mhz, I=2kb, E=0Kb
SS2         -             25             24          40MHz, I=0Kb, E=64Kb

It is obvious that the 10/41 & 10/51 have a write-through internal cache,
whereas the 10/40 has a write-back internal cache.
For a 10/40, a store this hits the internal cache can be issued every cycle.
For a 10/41 or 51, once the store buffer is filled, stores can be executed
every 1 3/4 cycles, on average.
For the SS2, 24Mb/s translates to 13.0 cycles for each std, or 6.5 cycles per
word, so either storing is pipelined, or 8 cycles only applies to single word
stores.

David Robinson. (drtr@mail.ast.cam.ac.uk)


From rothberg@SSD.intel.com  Ukn Oct 14 16:58:35 1993
Received: from brahms.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA24426; Thu, 14 Oct 93 16:58:33 -0400
Return-Path: <rothberg@SSD.intel.com>
Received: from SSD.intel.com (ssd.intel.com [137.46.201.30]) by brahms.udel.edu (8.6.beta.11/8.6.beta.2) with SMTP id QAA10316 for <mccalpin@brahms.udel.edu>; Thu, 14 Oct 1993 16:56:30 -0400
From: rothberg@SSD.intel.com
Received: from warthog.ssd.intel.com by SSD.intel.com (4.1/SMI-4.1)
	id AA07184; Thu, 14 Oct 93 13:56:28 PDT
Message-Id: <9310142056.AA07184@SSD.intel.com>
To: mccalpin@brahms.udel.edu
Subject: STREAM benchmark
Date: Thu, 14 Oct 93 13:56:27 -0700
Status: RO
X-Status: 


I saw the table you posted a few weeks ago in comp.arch, comparing
achieved memory bandwidths for several machines.  Very interesting
numbers.  I was wondering whether you've got any data on the new IBM
POWER2 machines.  Also, is there any chance I could get a copy of the
codes you used?  I'd like to try running the test on a few other
machines.  I realize that they are all just 3-line programs, but it
makes direct comparison easier if I know that I'm using the same
source.

Thanks,

Ed Rothberg
Intel Supercomputer Systems Division
rothberg@ssd.intel.com


From Renu.Raman@Eng.Sun.COM Tue Oct 26 20:54:12 1993
Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA04709; Tue, 26 Oct 93 20:54:11 -0400
Return-Path: <Renu.Raman@Eng.Sun.COM>
Received: from Sun.COM (Sun.COM [192.9.9.1]) by bach.udel.edu (8.6.beta.11/8.6.beta.2) with SMTP id UAA04876 for <mccalpin@bach.udel.edu>; Tue, 26 Oct 1993 20:52:44 -0400
Received: from Eng.Sun.COM (zigzag.Eng.Sun.COM) by Sun.COM (4.1/SMI-4.1)
	id AA09136; Tue, 26 Oct 93 17:52:27 PDT
Received: from shukra.Eng.Sun.COM by Eng.Sun.COM (4.1/SMI-4.1)
	id AA10081; Tue, 26 Oct 93 17:52:12 PDT
Received: by shukra.Eng.Sun.COM (4.1/SMI-4.1)
	id AA05341; Tue, 26 Oct 93 17:52:25 PDT
Date: Tue, 26 Oct 93 17:52:25 PDT
From: Renu.Raman@Eng.Sun.COM (Renu Raman)
Message-Id: <9310270052.AA05341@shukra.Eng.Sun.COM>
To: mccalpin@bach.udel.edu
Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy?
Newsgroups: comp.benchmarks
In-Reply-To: <GEOMAGIC.93Oct23222209@moe.seismo.do.usbr.gov>
References: <2a7utr$lf5@nh1.u-aizu.ac.jp>
Organization: Sun
Cc: geomagic@seismo.do.usbr.gov
Status: RO

Update of SS10/41
 
>                    Bytes          Bandwidth (MB/s)               
>Machine             /word      Copy     Scale       Sum     Triad 
>---------------     -----  --------  --------  --------  -------- 
>Sun SS10/41             4      34.3      38.4      36.9      37.9 
>Sun SS10/30             8      42.1      46.2      46.2      46.2 
>Sun SS10/30             4      33.9      41.7      39.0      34.1 

Sun SS10/41              8      48.0      48.0      54.0      54.0
Sun SS10/512             8      48.0      48.0      48.0      48.0

Obviously as the memory system does not scale, its about the same....
These numbers were obtained using the Apogee compilers...

renu raman

From hahn@neurocog.lrdc.pitt.edu  Ukn Oct 27 12:03:18 1993
Received: from neurocog.lrdc.pitt.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA05504; Wed, 27 Oct 93 12:03:16 -0400
Return-Path: <hahn@neurocog.lrdc.pitt.edu>
Message-Id: <9310271603.AA05504@perelandra.cms.udel.edu>
Received: by neurocog.lrdc.pitt.edu
	(1.37.109.4/16.2) id AA29337; Wed, 27 Oct 93 12:01:59 -0400
From: Mark Hahn <hahn@neurocog.lrdc.pitt.edu>
Subject: stream in C
To: mccalpin
Date: Wed, 27 Oct 1993 12:01:58 -0500 (EDT)
Cc: hahn@neurocog.lrdc.pitt.edu (Mark Hahn, lrdc 512, 6247063, 3633618)
X-Mailer: ELM [version 2.4 PL21]
Content-Type: text
Content-Length: 4344      
Status: RO
X-Status: 

I had no luck finding the correct time/clock function for the
little-used fortran on this machine, an hp 735, so I translated
your stream.f into reasonably portable C.  I'd appreciate it if
you would look it over and verify that it's doing something 
comparable to the fortran version.  my intent was to post the
c version on comp.benchmarks.  BTW, with this code, our 735 gets
about 73 mb/s double saxpy.

/*
* Program: Stream
* Programmer: John D. McCalpin
* Revision: 2.0, September 30,1991
*
* This program measures memory transfer rates in MB/s for simple 
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
* INSTRUCTIONS:
*	1) (fortran-specific, omitted.)
*	2) Stream requires a good bit of memory to run.
*	   Adjust the Parameter 'N' in the second line of the main
*	   program to give a 'timing calibration' of at least 20 clicks.
*	   This will provide rate estimates that should be good to 
*	   about 5% precision.
*	3) Compile the code with full optimization.  Many compilers
*	   generate unreasonably bad code before the optimizer tightens
*	   things up.  If the results are unreasonable good, on the
*	   other hand, the optimizer might be too smart for me!
*	4) Mail the results to mccalpin@perelandra.cms.udel.edu
*	   Be sure to include:
*		a) computer hardware model number and software revision
*		b) the compiler flags
*		c) all of the output from the test case.
* Thanks!
*
* this version was ported from fortran to c by mark hahn, hahn+@pitt.edu.
*/

#define N 1000000
#define NTIMES 10

#ifdef __hpux
#define _HPUX_SOURCE 1
#else
#define _INCLUDE_POSIX_SOURCE 1
#endif
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#include <stdio.h>

#ifndef MIN
#define MIN(x,y) ((x)<(y)?(x):(y))
#endif
#ifndef MAX
#define MAX(x,y) ((x)>(y)?(x):(y))
#endif

struct timeval tvStart;

void utimeStart() {
    struct timezone tz;
    gettimeofday(&tvStart,&tz);
}

float utime() {
    struct timeval tv;
    struct timezone tz;
    float utime;
    gettimeofday(&tv,&tz);
    utime = 1e6 * (tv.tv_sec - tvStart.tv_sec) + tv.tv_usec - tvStart.tv_usec;
    if (tv.tv_usec < tvStart.tv_usec)
	utime += 1e6;
    return utime;
}

typedef double real;
static real a[N],b[N],c[N];

int main() {
    int j,k;
    float times[4][NTIMES];
    static float rmstime[4] = {0};
    static float mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
    static float maxtime[4] = {0};
    static char *label[4] = {"Assignment:",
			     "Scaling   :",
			     "Summing   :",
			     "SAXPYing  :"};
    static float bytes[4] = { 2 * sizeof(real) * N,
			      2 * sizeof(real) * N,
			      3 * sizeof(real) * N,
			      3 * sizeof(real) * N};

    /* --- SETUP --- determine precision and check timing --- */
    utimeStart();
    for (j=0; j<N; j++) {
	a[j] = 1.0;
	b[j] = 2.0;
	c[j] = 0.0;
    }
    printf("Timing calibration ; time = %f usec.\n",utime());
    printf("Increase the size of the arrays if this is < 300000\n"
	   "and your clock precision is =< 1/100 second.\n");
    printf("---------------------------------------------------\n");
    
    /*	--- MAIN LOOP --- repeat test cases NTIMES times --- */
    for (k=0; k<NTIMES; k++) {
	utimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j];
	times[0][k] = utime();
	
	utimeStart();
	for (j=0; j<N; j++)
	    c[j] = 3.0e0*a[j];
	times[1][k] = utime();
	
	utimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j]+b[j];
	times[2][k] = utime();
	
	utimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j]+3.0e0*b[j];
	times[3][k] = utime();
    }
    
    /*	--- SUMMARY --- */
    for (k=0; k<NTIMES; k++) {
	for (j=0; j<4; j++) {
	    rmstime[j] = rmstime[j] + (times[j][k] * times[j][k]);
	    mintime[j] = MIN(mintime[j], times[j][k]);
	    maxtime[j] = MAX(maxtime[j], times[j][k]);
	}
    }
    
    printf("Function Rate   (MB/s)   RMS time     Min time     Max time\n");
    for (j=0; j<4; j++) {
	rmstime[j] = sqrt(rmstime[j]/(float)NTIMES);

	printf("%s%11.3f  %11.3f  %11.3f  %11.3f\n",
	       label[j],
	       bytes[j]/mintime[j],
	       rmstime[j],
	       mintime[j],
	       maxtime[j]);
    }
    return 0;
}

thanks, mark hahn.
-- 
this space intentionally left non-blank.	hahn@neurocog.lrdc.pitt.edu

From hahn@neurocog.lrdc.pitt.edu  Ukn Oct 27 12:35:23 1993
Received: from neurocog.lrdc.pitt.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA05615; Wed, 27 Oct 93 12:35:22 -0400
Return-Path: <hahn@neurocog.lrdc.pitt.edu>
Message-Id: <9310271635.AA05615@perelandra.cms.udel.edu>
Received: by neurocog.lrdc.pitt.edu
	(1.37.109.4/16.2) id AA29898; Wed, 27 Oct 93 12:34:05 -0400
From: Mark Hahn <hahn@neurocog.lrdc.pitt.edu>
Subject: Re: stream in C
To: mccalpin (John D. McCalpin)
Date: Wed, 27 Oct 1993 12:34:05 -0500 (EDT)
In-Reply-To: <9310271627.AA05590@perelandra.cms.udel.edu> from "John D. McCalpin" at Oct 27, 93 12:27:59 pm
X-Mailer: ELM [version 2.4 PL21]
Content-Type: text
Content-Length: 635       
Status: RO
X-Status: 

> There is one significant difference -- your code measures the elapsed
> time, while the Fortran version measures the cpu time.  If your machine
> is almost idle, then the *minimum* elapsed time is not a bad measure
> of the cpu time, otherwise it is nearly impossible to compare.
> 
> I thought that HP had an etime(dummy) function available from Fortran.
> Did you look for that one?
> 
> Alternatively, you can use what most folks use, provided that you can
> figure out how to link C and Fortran....


good comments, I'll try all three.

thanks, mark hahn.
-- 
this space intentionally left non-blank.	hahn@neurocog.lrdc.pitt.edu

From Renu.Raman@Eng.Sun.COM  Ukn Oct 27 14:34:43 1993
Received: from Sun.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA06233; Wed, 27 Oct 93 14:34:26 -0400
Return-Path: <Renu.Raman@Eng.Sun.COM>
Received: from Eng.Sun.COM (zigzag.Eng.Sun.COM) by Sun.COM (4.1/SMI-4.1)
	id AA21857; Wed, 27 Oct 93 11:32:15 PDT
Received: from shukra.Eng.Sun.COM by Eng.Sun.COM (4.1/SMI-4.1)
	id AA21259; Wed, 27 Oct 93 11:31:16 PDT
Received: by shukra.Eng.Sun.COM (4.1/SMI-4.1)
	id AA06850; Wed, 27 Oct 93 11:31:33 PDT
Date: Wed, 27 Oct 93 11:31:33 PDT
From: Renu.Raman@Eng.Sun.COM (Renu Raman)
Message-Id: <9310271831.AA06850@shukra.Eng.Sun.COM>
To: mccalpin
Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy?
Status: RO
X-Status: 


>From mccalpin@perelandra.cms.udel.edu Wed Oct 27 04:50:39 1993
48.0
>
>Could you please send the raw output of the tests?
>I am trying to keep very complete records for all the entries...
>
>Thanks!


Here are the details

SparcClassic (uSPARC 50MhZ)
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    101.6666617244482     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   57.6001      0.0937      0.0833      0.1000
Scaling   :   48.0000      0.1122      0.1000      0.1333
Summing   :   48.0000      0.1652      0.1500      0.1833
SAXPYing  :   43.2000      0.1852      0.1667      0.2000
*******************************

SS10/41 Without E$ and 128MB of memory

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    64.99999836087227     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   48.0000      0.1087      0.1000      0.1167
Scaling   :   48.0000      0.1070      0.1000      0.1167
Summing   :   54.0001      0.1402      0.1333      0.1500
SAXPYing  :   54.0001      0.1385      0.1333      0.1500
***************************************************

SS10/512 - dual Vikings@50MHZ with E$ (1MB)
--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =    24.00000095367432     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   48.0000      0.1031      0.1000      0.1100
Scaling   :   48.0000      0.1051      0.1000      0.1100
Summing   :   48.0000      0.1561      0.1500      0.1600
SAXPYing  :   48.0000      0.1561      0.1500      0.1600


renu raman

From bt@irfu.se  Ukn Oct 27 18:01:29 1993
Received: from irfu.irfu.se by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA06568; Wed, 27 Oct 93 18:01:22 -0400
Return-Path: <bt@irfu.se>
Received: from abba.irfu.se by irfu.irfu.se with SMTP
	(16.6/15.6) id AA04121; Wed, 27 Oct 93 22:59:25 +0100
Received: by abba.irfu.se
	(1.37.109.4/15.6) id AA09035; Wed, 27 Oct 93 22:58:31 +0100
From: Bo Thide' <bt@irfu.se>
Message-Id: <9310272158.AA09035@abba.irfu.se>
Subject: Streams rsults for HP9000/720 and 735
To: mccalpin
Date: Wed, 27 Oct 93 22:58:31 MET
Mailer: Elm [revision: 70.85]
Status: RO
X-Status: 

Hi John,

Just wanted to send you the results I obtained for stream on some
of our HP's:


HP9000/720:
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =  22.99999 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   45.7143       .0762       .0700       .0800  
Scaling   :   53.3334       .0662       .0600       .0700  
Summing   :   48.0000       .1090       .1000       .1100  
SAXPYing  :   53.3334       .0921       .0900       .1000  

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =  34.00000147521495 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   53.3334       .1011       .0900       .1100  
Scaling   :   53.3334       .0991       .0900       .1100  
Summing   :   55.3847       .1361       .1300       .1400  
SAXPYing  :   55.3847       .1381       .1300       .1400  


HP9000/735:
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =  14.0 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   80.0000       .0493       .0400       .0600  
Scaling   :  106.6668       .0486       .0300       .0600  
Summing   :   80.0001       .0663       .0600       .0800  
SAXPYing  :   95.9999       .0687       .0500       .0800  

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =  22.00000043958425 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   68.5715       .0752       .0700       .0800  
Scaling   :   80.0001       .0743       .0600       .0800  
Summing   :   80.0001       .0993       .0900       .1100  
SAXPYing  :   90.0001       .0972       .0800       .1000  


I leave it to you to calculate the MFLOPS from these values.  BTW, I
used HP-UX 9.01 f77 with less than full optimization.  With max
optimization, dead-code elmination gave ridiculous results (infinite
speed).

Bo

From hahn@neurocog.lrdc.pitt.edu Thu Oct 28 08:02:01 EDT 1993
Article: 12912 of comp.benchmarks
Path: news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!uunet!pitt.edu!neurocog.lrdc.pitt.edu!hahn
From: hahn@neurocog.lrdc.pitt.edu (Mark Hahn)
Newsgroups: comp.benchmarks
Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy?
Message-ID: <5316@blue.cis.pitt.edu>
Date: 28 Oct 93 02:21:05 GMT
References: <2a7utr$lf5@nh1.u-aizu.ac.jp> <GEOMAGIC.93Oct23222209@moe.seismo.do.usbr.gov> <5190@blue.cis.pitt.edu> <schumach.751684941@convex.convex.com>
Sender: news+@pitt.edu
Lines: 160
X-Newsreader: TIN [version 1.2 PL2]
Status: RO

appended to this message is a a fairly portable C translation of stream.f.
on our hp735 and "cc +P +O3 -J +Om1 -Wl,-a,archive", I get these results:

Timing calibration ; time = 760.00 usec.
Increase the size of the arrays if this is < 300
and your clock precision is =< 1/100 second.
---------------------------------------------------
Function Rate   (MB/s)     RMS time    Min time    Max time
Assignment:     69.837     247.083     240.000     260.000
Scaling   :     69.837     246.049     240.000     250.000
Summing   :     71.832     351.013     350.000     360.000
SAXPYing  :     73.945     350.143     340.000     370.000

The code is also available for anon ftp from 
neurocog.lrdc.pitt.edu:pub/cstream.c

/*
* Program: Stream
* Programmer: John D. McCalpin
* Revision: 2.0, September 30,1991
*
* This program measures memory transfer rates in MB/s for simple 
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
* INSTRUCTIONS:
*	1) (fortran-specific, omitted.)
*	2) Stream requires a good bit of memory to run.
*	   Adjust the Parameter 'N' in the second line of the main
*	   program to give a 'timing calibration' of at least 20 clicks.
*	   This will provide rate estimates that should be good to 
*	   about 5% precision.
*	3) Compile the code with full optimization.  Many compilers
*	   generate unreasonably bad code before the optimizer tightens
*	   things up.  If the results are unreasonable good, on the
*	   other hand, the optimizer might be too smart for me!
*	4) Mail the results to mccalpin@perelandra.cms.udel.edu
*	   Be sure to include:
*		a) computer hardware model number and software revision
*		b) the compiler flags
*		c) all of the output from the test case.
*
* Thanks!
*
* This version was ported from the fortran by Mark Hahn, hahn+@pitt.edu.
*/

#define N (1023*1024)
#define NTIMES 10

#define _HPUX_SOURCE 1
#define _POSIX_SOURCE 1
#define _XOPEN_SOURCE 1
#define _INCLUDE_POSIX_SOURCE 1

#include <limits.h>
#include <time.h>
#include <sys/times.h>
#include <math.h>
#include <stdio.h>

#ifndef MIN
#define MIN(x,y) ((x)<(y)?(x):(y))
#define MAX(x,y) ((x)>(y)?(x):(y))
#endif

struct tms tmsStart;

void mtimeStart() {
    times(&tmsStart);
}

float mtime() {
    struct tms t;
    times(&t);
    return 1e3 * (float) ((t.tms_stime - tmsStart.tms_stime) + 
	     (t.tms_utime - tmsStart.tms_utime)) / (float) CLK_TCK;
}

typedef double real;
static real a[N],b[N],c[N];

int main() {
    int j,k;
    float times[4][NTIMES];
    static float rmstime[4] = {0};
    static float mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
    static float maxtime[4] = {0};
    static char *label[4] = {"Assignment:",
			     "Scaling   :",
			     "Summing   :",
			     "SAXPYing  :"};
    static float bytes[4] = { 2 * sizeof(real) * N,
			      2 * sizeof(real) * N,
			      3 * sizeof(real) * N,
			      3 * sizeof(real) * N};

    /* --- SETUP --- determine precision and check timing --- */
    mtimeStart();
    for (j=0; j<N; j++) {
	a[j] = 1.0;
	b[j] = 2.0;
	c[j] = 0.0;
    }
    printf("Timing calibration ; time = %.2f usec.\n",mtime());
    printf("Increase the size of the arrays if this is < 300\n"
	   "and your clock precision is =< 1/100 second.\n");
    printf("---------------------------------------------------\n");
    
    /*	--- MAIN LOOP --- repeat test cases NTIMES times --- */
    for (k=0; k<NTIMES; k++) {
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j];
	times[0][k] = mtime();
	
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = 3.0e0*a[j];
	times[1][k] = mtime();
	
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j]+b[j];
	times[2][k] = mtime();
	
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j]+3.0e0*b[j];
	times[3][k] = mtime();
    }
    
    /*	--- SUMMARY --- */
    for (k=0; k<NTIMES; k++) {
	for (j=0; j<4; j++) {
	    rmstime[j] = rmstime[j] + (times[j][k] * times[j][k]);
	    mintime[j] = MIN(mintime[j], times[j][k]);
	    maxtime[j] = MAX(maxtime[j], times[j][k]);
	}
    }
    
    printf("Function Rate   (MB/s)     RMS time    Min time    Max time\n");
    for (j=0; j<4; j++) {
	rmstime[j] = sqrt(rmstime[j]/(float)NTIMES);

	printf("%s%11.3f %11.3f %11.3f %11.3f\n",
	       label[j],
	       bytes[j]/mintime[j]/1e3,
	       rmstime[j],
	       mintime[j],
	       maxtime[j]);
    }
    return 0;
}

regards, mark hahn.
--
this space intentionally left non-blank.	hahn@neurocog.lrdc.pitt.edu


From bt@irfu.se  Ukn Oct 28 13:55:10 1993
Received: from irfu.irfu.se by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA09375; Thu, 28 Oct 93 13:53:03 -0400
Return-Path: <bt@irfu.se>
Received: from hybrid.irfu.se by irfu.irfu.se with SMTP
	(16.6/15.6) id AA29666; Thu, 28 Oct 93 18:50:39 +0100
Received: by hybrid.irfu.se
	(1.37.109.4/16.2) id AA00268; Thu, 28 Oct 93 18:51:00 +0100
From: Bo Thide' <bt@irfu.se>
Message-Id: <9310281751.AA00268@hybrid.irfu.se>
Subject: Re: Streams rsults for HP9000/720 and 735
To: mccalpin (John D. McCalpin)
Date: Thu, 28 Oct 1993 18:51:00 +0100 (MET)
In-Reply-To: <Pine.3.07.9310281048.B7659-9100000@perelandra.cms.udel.edu> from "John D. McCalpin" at Oct 28, 93 10:09:53 am
Reply-To: bt@irfu.se
Organization: Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
System: HP-UX A.09.01 9000/720
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Content-Length: 874       
Status: RO
X-Status: 

You (John D. McCalpin) write:
> 
> Thanks for the results....
> 
> It looks like the runs were too short for accurate measurements,
> assuming that the clock is .01 second resolution.  Several of the later
> tests are only 3-4 clock ticks.  They need to be about 10 times longer to
> get accurate measurements....

But how can I do that?  I have already upped the n (= array size) to
the limit my present kernel can accept.  Is it acceptable to instead
incrase ntimes from 10 to 100?

Bo

-- 
   ^   Bo Thide'---------------------------------------------Science Director-
  |I|        Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
  |R|    Phone: (+46) 18-303671.  Fax: (+46) 18-403100.  IP: 130.238.30.23
 /|F|\          INTERNET: bt@irfu.se      UUCP: ...!mcvax!sunic!irfu!bt  
 ~~U~~ ----------------------------------------------------------------sm5dfw-

From bt@irfu.se  Ukn Oct 28 14:14:13 1993
Received: from irfu.irfu.se by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA09436; Thu, 28 Oct 93 14:14:10 -0400
Return-Path: <bt@irfu.se>
Received: from hybrid.irfu.se by irfu.irfu.se with SMTP
	(16.6/15.6) id AA29892; Thu, 28 Oct 93 19:12:19 +0100
Received: by hybrid.irfu.se
	(1.37.109.4/16.2) id AA00493; Thu, 28 Oct 93 19:12:40 +0100
From: Bo Thide' <bt@irfu.se>
Message-Id: <9310281812.AA00493@hybrid.irfu.se>
Subject: Re: Streams rsults for HP9000/720 and 735
To: mccalpin (John D. McCalpin)
Date: Thu, 28 Oct 1993 19:12:39 +0100 (MET)
In-Reply-To: <Pine.3.07.9310281048.B7659-9100000@perelandra.cms.udel.edu> from "John D. McCalpin" at Oct 28, 93 10:09:53 am
Reply-To: bt@irfu.se
Organization: Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
System: HP-UX A.09.01 9000/720
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Content-Length: 588       
Status: RO
X-Status: 

Hi again,

I have had a closer look at the code and see that one must either
increase n or use another timing technique.  I now see that the ntimes
parameter does not affect the clock granularity.


Bo

-- 
   ^   Bo Thide'---------------------------------------------Science Director-
  |I|        Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
  |R|    Phone: (+46) 18-303671.  Fax: (+46) 18-403100.  IP: 130.238.30.23
 /|F|\          INTERNET: bt@irfu.se      UUCP: ...!mcvax!sunic!irfu!bt  
 ~~U~~ ----------------------------------------------------------------sm5dfw-

From bt@irfu.se Thu Oct 28 16:44:37 1993
Received: from irfu.irfu.se by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA10285; Thu, 28 Oct 93 16:44:33 -0400
Return-Path: <bt@irfu.se>
Received: from hybrid.irfu.se by irfu.irfu.se with SMTP
	(16.6/15.6) id AA01167; Thu, 28 Oct 93 21:42:41 +0100
Received: by hybrid.irfu.se
	(1.37.109.4/16.2) id AA01831; Thu, 28 Oct 93 21:43:02 +0100
From: Bo Thide' <bt@irfu.se>
Message-Id: <9310282043.AA01831@hybrid.irfu.se>
Subject: Re: Streams rsults for HP9000/720 and 735
To: mccalpin (John D. McCalpin)
Date: Thu, 28 Oct 1993 21:43:02 +0100 (MET)
In-Reply-To: <Pine.3.07.9310281503.C10167-a100000@perelandra.cms.udel.edu> from "John D. McCalpin" at Oct 28, 93 03:50:04 pm
Reply-To: bt@irfu.se
Organization: Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
System: HP-UX A.09.01 9000/720
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Content-Length: 2621      
Status: RO

You (John D. McCalpin) write:
> 
> On Thu, 28 Oct 1993, Bo Thide' wrote:
> 
> > I have had a closer look at the code and see that one must either
> > increase n or use another timing technique.  I now see that the ntimes
> > parameter does not affect the clock granularity.
> 
> It does not affect the timing directly, but it does allow averaging
> over more ticks.   I am not convinced that the answers are the same as for
> timing the larger segment separately, but it is the best that can be done
> if memory is limited.
> 

I just rebuilt my 720 kernel with a four times larger stack size and
increased the n parameter a bit.  Here are the new HP9000/720 results:

--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =  118.0 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   44.4446       .3650       .3600       .3700  
Scaling   :   47.0589       .3400       .3400       .3400  
Summing   :   48.0001       .5020       .5000       .5100  
SAXPYing  :   52.1739       .4650       .4600       .4700  

--------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLEPRECISION word
--------------------------------------
 Timing calibration ; time =  114.0000076964497 hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   48.4849       .3330       .3300       .3400  
Scaling   :   50.0000       .3260       .3200       .3300  
Summing   :   50.0000       .4810       .4800       .4900  
SAXPYing  :   53.3335       .4560       .4500       .4600  

As I see it, the best way would be to implement a timing routine based on
gettimeofday.  On the HP9000/700 this timer has a 1 microsec resolution.
The etime timer has a 10 millisec resolution.

Bo

-- 
   ^   Bo Thide'---------------------------------------------Science Director-
  |I|        Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
  |R|    Phone: (+46) 18-303671.  Fax: (+46) 18-403100.  IP: 130.238.30.23
 /|F|\          INTERNET: bt@irfu.se      UUCP: ...!mcvax!sunic!irfu!bt  
 ~~U~~ ----------------------------------------------------------------sm5dfw-

From hahn@neurocog.lrdc.pitt.edu Thu Oct 28 08:02:01 EDT 1993
Article: 12912 of comp.benchmarks
Path: news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!uunet!pitt.edu!neurocog.lrdc.pitt.edu!hahn
From: hahn@neurocog.lrdc.pitt.edu (Mark Hahn)
Newsgroups: comp.benchmarks
Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy?
Message-ID: <5316@blue.cis.pitt.edu>
Date: 28 Oct 93 02:21:05 GMT
References: <2a7utr$lf5@nh1.u-aizu.ac.jp> <GEOMAGIC.93Oct23222209@moe.seismo.do.usbr.gov> <5190@blue.cis.pitt.edu> <schumach.751684941@convex.convex.com>
Sender: news+@pitt.edu
Lines: 160
X-Newsreader: TIN [version 1.2 PL2]
Status: RO

appended to this message is a a fairly portable C translation of stream.f.
on our hp735 and "cc +P +O3 -J +Om1 -Wl,-a,archive", I get these results:

Timing calibration ; time = 760.00 usec.
Increase the size of the arrays if this is < 300
and your clock precision is =< 1/100 second.
---------------------------------------------------
Function Rate   (MB/s)     RMS time    Min time    Max time
Assignment:     69.837     247.083     240.000     260.000
Scaling   :     69.837     246.049     240.000     250.000
Summing   :     71.832     351.013     350.000     360.000
SAXPYing  :     73.945     350.143     340.000     370.000

The code is also available for anon ftp from 
neurocog.lrdc.pitt.edu:pub/cstream.c

/*
* Program: Stream
* Programmer: John D. McCalpin
* Revision: 2.0, September 30,1991
*
* This program measures memory transfer rates in MB/s for simple 
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
* INSTRUCTIONS:
*	1) (fortran-specific, omitted.)
*	2) Stream requires a good bit of memory to run.
*	   Adjust the Parameter 'N' in the second line of the main
*	   program to give a 'timing calibration' of at least 20 clicks.
*	   This will provide rate estimates that should be good to 
*	   about 5% precision.
*	3) Compile the code with full optimization.  Many compilers
*	   generate unreasonably bad code before the optimizer tightens
*	   things up.  If the results are unreasonable good, on the
*	   other hand, the optimizer might be too smart for me!
*	4) Mail the results to mccalpin@perelandra.cms.udel.edu
*	   Be sure to include:
*		a) computer hardware model number and software revision
*		b) the compiler flags
*		c) all of the output from the test case.
*
* Thanks!
*
* This version was ported from the fortran by Mark Hahn, hahn+@pitt.edu.
*/

#define N (1023*1024)
#define NTIMES 10

#define _HPUX_SOURCE 1
#define _POSIX_SOURCE 1
#define _XOPEN_SOURCE 1
#define _INCLUDE_POSIX_SOURCE 1

#include <limits.h>
#include <time.h>
#include <sys/times.h>
#include <math.h>
#include <stdio.h>

#ifndef MIN
#define MIN(x,y) ((x)<(y)?(x):(y))
#define MAX(x,y) ((x)>(y)?(x):(y))
#endif

struct tms tmsStart;

void mtimeStart() {
    times(&tmsStart);
}

float mtime() {
    struct tms t;
    times(&t);
    return 1e3 * (float) ((t.tms_stime - tmsStart.tms_stime) + 
	     (t.tms_utime - tmsStart.tms_utime)) / (float) CLK_TCK;
}

typedef double real;
static real a[N],b[N],c[N];

int main() {
    int j,k;
    float times[4][NTIMES];
    static float rmstime[4] = {0};
    static float mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
    static float maxtime[4] = {0};
    static char *label[4] = {"Assignment:",
			     "Scaling   :",
			     "Summing   :",
			     "SAXPYing  :"};
    static float bytes[4] = { 2 * sizeof(real) * N,
			      2 * sizeof(real) * N,
			      3 * sizeof(real) * N,
			      3 * sizeof(real) * N};

    /* --- SETUP --- determine precision and check timing --- */
    mtimeStart();
    for (j=0; j<N; j++) {
	a[j] = 1.0;
	b[j] = 2.0;
	c[j] = 0.0;
    }
    printf("Timing calibration ; time = %.2f usec.\n",mtime());
    printf("Increase the size of the arrays if this is < 300\n"
	   "and your clock precision is =< 1/100 second.\n");
    printf("---------------------------------------------------\n");
    
    /*	--- MAIN LOOP --- repeat test cases NTIMES times --- */
    for (k=0; k<NTIMES; k++) {
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j];
	times[0][k] = mtime();
	
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = 3.0e0*a[j];
	times[1][k] = mtime();
	
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j]+b[j];
	times[2][k] = mtime();
	
	mtimeStart();
	for (j=0; j<N; j++)
	    c[j] = a[j]+3.0e0*b[j];
	times[3][k] = mtime();
    }
    
    /*	--- SUMMARY --- */
    for (k=0; k<NTIMES; k++) {
	for (j=0; j<4; j++) {
	    rmstime[j] = rmstime[j] + (times[j][k] * times[j][k]);
	    mintime[j] = MIN(mintime[j], times[j][k]);
	    maxtime[j] = MAX(maxtime[j], times[j][k]);
	}
    }
    
    printf("Function Rate   (MB/s)     RMS time    Min time    Max time\n");
    for (j=0; j<4; j++) {
	rmstime[j] = sqrt(rmstime[j]/(float)NTIMES);

	printf("%s%11.3f %11.3f %11.3f %11.3f\n",
	       label[j],
	       bytes[j]/mintime[j]/1e3,
	       rmstime[j],
	       mintime[j],
	       maxtime[j]);
    }
    return 0;
}

regards, mark hahn.
--
this space intentionally left non-blank.	hahn@neurocog.lrdc.pitt.edu


From r02kar@rec06.desy.de Thu Nov  4 12:38:08 1993
Received: from rec06.desy.de by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI)
	for mccalpin id AA24576; Thu, 4 Nov 93 12:35:43 -0500
Return-Path: <r02kar@rec06.desy.de>
Received: from localhost.desy.de by rec06.desy.de via SMTP (920330.SGI/920502.SGI.AUTO)
	for mccalpin@perelandra.cms.udel.edu id AA10232; Thu, 4 Nov 93 18:34:39 +0100
Message-Id: <9311041734.AA10232@rec06.desy.de>
To: mccalpin
Subject: stream result
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Thu, 04 Nov 1993 18:34:38 +0100
From: Karsten Kuenne <r02kar@rec06.desy.de>
Status: RO
X-Status: 

Hi John,

this is what I get on a SGI Challenge ( 1 cpu ).

zarah2:~/benchmarks-> uname -a                                           18:25 
IRIX zarah2 5.1.1.1 10011612 IP19 mips

zarah2:~/benchmarks-> hinv                                               18:25 
18 150 MHZ IP19 Processors
CPU: MIPS R4400 Processor Chip Revision: 5.0
FPU: MIPS R4010 Floating Point Chip Revision: 0.0
Data cache size: 16 Kbytes
Instruction cache size: 16 Kbytes
Secondary unified instruction/data cache size: 1 Mbyte
Main memory size: 512 Mbytes, 4-way interleaved
I/O board, Ebus slot 15: IO4 revision 1
Integral IO4 serial ports: 4
Integral Ethernet controller: et0, Ebus slot 15
Integral SCSI controller 1: Version WD33C95A
Disk drive: unit 8 on SCSI controller 1
Disk drive: unit 4 on SCSI controller 1
Disk drive: unit 3 on SCSI controller 1
Disk drive: unit 2 on SCSI controller 1
Disk drive: unit 1 on SCSI controller 1
Integral SCSI controller 0: Version WD33C95A
Disk drive: unit 5 on SCSI controller 0
Disk drive: unit 4 on SCSI controller 0
Disk drive: unit 1 on SCSI controller 0
Integral Ethernet: ec0, version 1
Integral IO4 parallel port: Ebus slot 15
VME bus: adapter 0 mapped to adapter 61
VME bus: adapter 61


Compiler options:

f77 -ddopt -mips2 -O4 -non_shared -o stream stream.f


zarah2:~/benchmarks-> ./stream                                           18:27 
--------------------------------------
 Single precision appears to have  7 digits of accuracy
 Assuming 4 bytes per default REAL word
--------------------------------------
 Timing calibration ; time =    42.00000     hundredths of a second
 Increase the size of the arrays if this is <30 
  and your clock precision is =<1/100 second
 ---------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Assignment:   66.6667      0.1271      0.1200      0.1300
Scaling   :   66.6667      0.1291      0.1200      0.1400
Summing   :   66.6667      0.1921      0.1800      0.2000
SAXPYing  :   63.1579      0.1941      0.1900      0.2000


And this is for cstream:


Compiler options:

cc -sopt -O4 -mips2 -non_shared -o cstream cstream.c -lfastm


zarah2:~/benchmarks-> ./cstream                                          18:31 
Timing calibration ; time = 980.00 usec.
Increase the size of the arrays if this is < 300
and your clock precision is =< 1/100 second.
---------------------------------------------------
Function Rate   (MB/s)     RMS time    Min time    Max time
Assignment:     69.837     245.051     240.000     250.000
Scaling   :     69.837     250.080     240.000     260.000
Summing   :     69.837     368.076     360.000     380.000
SAXPYing  :     71.832     361.040     350.000     370.000


Best regards,
Karsten Kuenne.
--
////////////////////////////////////////////////////////////////////
Karsten Kuenne, DESY (-R2-), Notkestr. 85, 22607 Hamburg, Germany
phone: +49-40-8998-3315      fax: +49-40-8994-4429
e-mail: <kuenne@desy.de>, <r02kar@rec06.desy.de>,
        <r02kar@dhhdesy3.bitnet>