From turner@csrd.uiuc.edu Sat Jan 23 16:17:25 1993 Received: from s46.csrd.uiuc.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA28124; Sat, 23 Jan 93 16:17:25 -0500 Received: from a7.csrd.uiuc.edu by s46.csrd.uiuc.edu with SMTP id AA10384 (5.67a/IDA-1.5 for ); Sat, 23 Jan 1993 15:18:36 -0600 Received: by a7.csrd.uiuc.edu (4.12/9.2) id AA00934; Sat, 23 Jan 93 15:13:35 cst Date: Sat, 23 Jan 93 15:13:35 cst From: turner@csrd.uiuc.edu (Steve Turner) Message-Id: <9301232113.AA00934@a7.csrd.uiuc.edu> To: mccalpin Subject: Stream results Status: RO Your posting on comp.sys.super piqued my curiosity, so I snarfed the benchmark and ported it to our machines. I work for CSRD, and we have a bunch of Alliant machines here, as well as our own home-brewed agglomeration of 4 of FX/80s called Cedar. You probably have heard of us, so I'll just tell you what I did to get the results and then give results. I made two changes to the source code. First, I replaced the calls to "second" with calls to the High Resolution Clock timer facility (hrcget and hrcdelta) This is a microsecond resolution timer used for performance evaluation, so I think the results should be accurate. Second, I made slight changes to the result FORMAT statements, since Alliant's fortran compiler assume carriage control info is used. I will send you a copy of the altered source, if you want, but since the changes were so trivial it doesn't seem necessary. The results for an FX/80 with 8 processors (~11.75 MHz clock rate) compiled with Alliant's fortran compiler using only the "-O" option: -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 630.571000000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 72.8155 0.0692 0.0659 0.0739 Scaling : 71.5990 0.0700 0.0670 0.0746 Summing : 76.2793 0.0971 0.0944 0.1028 SAXPYing : 76.5143 0.0998 0.0941 0.1119 ---------------- The results for an FX/2800 using a 14 processor "cluster", compiled with Alliant's fortran compiler using just the "-O" option are: (sorry, I don't know the clock rate of the i860's) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 15.7310000000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 144.6655 0.0342 0.0332 0.0362 Scaling : 150.1877 0.0328 0.0320 0.0347 Summing : 135.1859 0.0549 0.0533 0.0584 SAXPYing : 125.3264 0.0590 0.0575 0.0618 ---------------- Since this seemed to run too fast, I bumped up the array size by one order of magnitude and ran it again. The "long stream" results are: -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 185.139000000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 309.0394 0.1650 0.1553 0.1882 Scaling : 305.9273 0.1658 0.1569 0.1834 Summing : 297.8160 0.2528 0.2418 0.2889 SAXPYing : 291.9708 0.2620 0.2466 0.3141 ---------------- I plan on porting it to Cedar, too, but this will require modification of the array declarations in order to distribute the arrays to the global memory. I'll send details along with the results once I get them. st From lfm@pgroup.com Sat Jan 23 22:08:10 1993 Received: from libby.pgroup.com by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA28286; Sat, 23 Jan 93 22:08:10 -0500 Received: by libby.pgroup.com id AA14392 (5.65c/IDA-1.4.4 for mccalpin@perelandra.cms.udel.edu); Sat, 23 Jan 1993 19:09:16 -0800 Date: Sat, 23 Jan 1993 19:09:16 -0800 From: Larry Meadows Message-Id: <199301240309.AA14392@libby.pgroup.com> To: mccalpin Subject: stream results Status: RO This is for a 40 mhz i860-XR workstation. I know of an i860-XP based PC card that gets 400 MB/Sec. So that makes it faster than any other workstation on your list, and most of the superminis. Now if it just had a superscalar FP unit... Maybe someone else will run it on the paragon. Intel wouldn't like it if I did. I always knew that that the HP systems only got their performance when things fit in cache. -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 93.00000000000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 160.0000 0.0333 0.0300 0.0400 Scaling : 160.0000 0.0363 0.0300 0.0400 Summing : 120.0000 0.0632 0.0600 0.0700 SAXPYing : 120.0000 0.0652 0.0600 0.0700 From lfm@pgroup.com Mon Jan 25 13:55:24 1993 Received: from libby.pgroup.com by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA01609; Mon, 25 Jan 93 13:55:24 -0500 Received: by libby.pgroup.com id AA19798 (5.65c/IDA-1.4.4 for mccalpin@perelandra.cms.udel.edu); Mon, 25 Jan 1993 10:56:39 -0800 Date: Mon, 25 Jan 1993 10:56:39 -0800 From: Larry Meadows Message-Id: <199301251856.AA19798@libby.pgroup.com> To: mccalpin Subject: Re: stream results Status: RO >>This is for a 40 mhz i860-XR workstation. > >Who makes this? What is its full model name? Stardent Vistra, manufactured by Oki Electric, equivalent to an Oki 7300 Model 20. 40 MHZ, 8KB data cache, 4KB instruction cache, 32MB of memory, running Unix System V Release 4, using The Portland Group's pgf77 version 2.1. Now sold by Kubota Computer. >>I know of an i860-XP based PC >>card that gets 400 MB/Sec. So that makes it faster than any other >>workstation on your list, and most of the superminis. Now if it just had >>a superscalar FP unit... > >Is that *really* 400 MB/s from compiled high-level code? >Is it one of those expensive versions with all SRAM instead of DRAM? 400 MB/Sec from compiled code. Note, however, that the code generation technique we use for compilation ends up calling an assembly coded routine to pull the data into cache, so it is running flat out. Nevertheless, the technique is generally applicable; it is equivalent to using cache as vector registers, and the assembly routines (called streamin/streamout routines), are equivalent to hardware vload/vstore instructions. They use static column DRAM (like the oki station above). Don't know the price. The specific board I mention is made by Transtech (try Richard Stevens -- rs@transt.co.uk if you want further information). lfm From mnp@Texaco.COM Mon Apr 19 18:35:36 1993 Received: from Texaco.TEXACO.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA01682; Mon, 19 Apr 93 18:35:36 -0400 Received: by Texaco.COM (4.1/SMI-4.1) id AA11387; Mon, 19 Apr 93 17:41:10 CDT Date: Mon, 19 Apr 93 17:41:10 CDT From: mnp@Texaco.COM (Mark N. Portney) Message-Id: <9304192241.AA11387@Texaco.COM> To: mccalpin Status: RO John, Thanks for the clear description of c = a + b. I do not understand how to predict the SGI performance. (It doesn't matter other than my curiousity!) On the Crimson, cache miss penalties (both) were quoted as 110 internal cycles with no write back, and 119 with write back. These seem consistent with some measurements. Apparently these also seem to apply to the Challenge (too bad). The latency is so large, it wouldn't matter if the bandwidth were infinite! :-) The current Challenge is 100 MHz internal, 50 MHz external, 47.6 MHz bus. Cache lines are 16 bytes primary, 128 bytes secondary, and the bus (256 bits wide) can deliver data on 4 out of 5 cycles on one transaction, for (256/8 bytes)*(47.6MHz)*(4/5) = 1.218 GB/s. (The secondary cache is 1MB). Next clock goes to 150 MHz, 75MHz, bus still at 47.6 MHz. I have asked SGI for an explanation of how the different parts of the system affect the latency, and how it should change in the future. If I hear from them, I'll send you the information. I believe the latency for the IBM 580 is about 15.5 cycles from other tests I have run, and a TLB miss penalty of 38 cycles. I have run streams on several machines here (included below). The Challenge runs better when I compile on the Crimson (? - I'll track this down). If you would like me to run some small benchmarks for you on the 580 (or Challenge), let me know. Thanks again, Mark stream_d SGI Crimson R4000 os 4.0.5 ftn 3.10 f77 -O2 -mips2 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 91.99999682605267 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 61.5385 0.2700 0.2600 0.2800 Scaling : 59.2594 0.2750 0.2700 0.2800 Summing : 58.5367 0.4140 0.4100 0.4200 SAXPYing : 59.9999 0.4080 0.4000 0.4100 stream_d SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Challenge -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 122.9999981820583 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 48.4847 0.3410 0.3300 0.3500 Scaling : 47.0589 0.3451 0.3400 0.3600 Summing : 50.0000 0.4921 0.4800 0.5000 SAXPYing : 54.5456 0.4451 0.4400 0.4600 stream_d SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Crimson -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 117.0000027865171 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 57.1429 0.2891 0.2800 0.3000 Scaling : 55.1724 0.3001 0.2900 0.3100 Summing : 53.3334 0.4641 0.4500 0.4800 SAXPYing : 54.5456 0.4450 0.4400 0.4500 stream_s SGI Crimson R4000 os 4.0.5 ftn 3.10 f77 -O2 -mips2 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 47.00000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 57.1429 0.1471 0.1400 0.1500 Scaling : 53.3333 0.1571 0.1500 0.1700 Summing : 54.5455 0.2270 0.2200 0.2300 SAXPYing : 52.1739 0.2371 0.2300 0.2500 stream_s SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Challenge -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 61.00000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 40.0000 0.2061 0.2000 0.2100 Scaling : 36.3636 0.2281 0.2200 0.2400 Summing : 41.3793 0.2981 0.2900 0.3100 SAXPYing : 48.0000 0.2550 0.2500 0.2600 stream_s SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Crimson -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 62.00000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 53.3333 0.1591 0.1500 0.1700 Scaling : 47.0589 0.1741 0.1700 0.1900 Summing : 50.0000 0.2440 0.2400 0.2500 SAXPYing : 48.0000 0.2571 0.2500 0.2700 stream_d SUN SS2 os 4.1.2 f77 SC1.0 f77 -O2 -cg89 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 382.99998268485 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 20.5129 0.8168 0.7800 1.0100 Scaling : 18.6047 0.8640 0.8600 0.8900 Summing : 21.8182 1.1482 1.1000 1.5000 SAXPYing : 22.4299 1.1078 1.0700 1.3400 stream_d SUN SS10/41 os 4.1.3 f77 SC2.0.1 f77 -O2 -cg92 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 238.33334110677 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 34.2857 0.4868 0.4667 0.5000 Scaling : 38.4001 0.4267 0.4167 0.4333 Summing : 36.9231 0.6534 0.6500 0.6667 SAXPYing : 37.8947 0.6467 0.6333 0.6500 stream_s SUN SS2 os 4.1.2 f77 SC1.0 f77 -O2 -cg89 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 191.000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 18.6046 0.4411 0.4300 0.4500 Scaling : 17.7778 0.4640 0.4500 0.4700 Summing : 19.3548 0.6270 0.6200 0.6300 SAXPYing : 20.6897 0.6021 0.5800 0.6100 stream_s SUN SS10/41 os 4.1.3 f77 SC2.0.1 f77 -O2 -cg92 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 131.667 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 36.9232 0.2302 0.2167 0.2500 Scaling : 34.2857 0.2418 0.2333 0.2500 Summing : 37.8947 0.3301 0.3167 0.3333 SAXPYing : 36.0000 0.3401 0.3333 0.3500 stream_s RS/6000-580 xlf 2.03 f77 -O Test #1 Failed = picalc=piexact Apparently Single=Double Precision Proceeding to Test #2 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 67.00000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 177.7778 .1880 .1800 .1900 Scaling : 118.5185 .2841 .2700 .2900 Summing : 137.1429 .3551 .3500 .3800 SAXPYing : 141.1765 .3551 .3400 .3700 -------------------------------------- stream_d RS/6000-580 xlf 2.03 f77 -O Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 133.000000000000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 266.6667 .2512 .2400 .2700 Scaling : 246.1538 .2661 .2600 .2800 Summing : 234.1463 .4261 .4100 .4500 SAXPYing : 228.5714 .4352 .4200 .4600 From mnp@Texaco.COM Mon Apr 19 18:35:36 1993 Received: from Texaco.TEXACO.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA01682; Mon, 19 Apr 93 18:35:36 -0400 Received: by Texaco.COM (4.1/SMI-4.1) id AA11387; Mon, 19 Apr 93 17:41:10 CDT Date: Mon, 19 Apr 93 17:41:10 CDT From: mnp@Texaco.COM (Mark N. Portney) Message-Id: <9304192241.AA11387@Texaco.COM> To: mccalpin Status: RO John, Thanks for the clear description of c = a + b. I do not understand how to predict the SGI performance. (It doesn't matter other than my curiousity!) On the Crimson, cache miss penalties (both) were quoted as 110 internal cycles with no write back, and 119 with write back. These seem consistent with some measurements. Apparently these also seem to apply to the Challenge (too bad). The latency is so large, it wouldn't matter if the bandwidth were infinite! :-) The current Challenge is 100 MHz internal, 50 MHz external, 47.6 MHz bus. Cache lines are 16 bytes primary, 128 bytes secondary, and the bus (256 bits wide) can deliver data on 4 out of 5 cycles on one transaction, for (256/8 bytes)*(47.6MHz)*(4/5) = 1.218 GB/s. (The secondary cache is 1MB). Next clock goes to 150 MHz, 75MHz, bus still at 47.6 MHz. I have asked SGI for an explanation of how the different parts of the system affect the latency, and how it should change in the future. If I hear from them, I'll send you the information. I believe the latency for the IBM 580 is about 15.5 cycles from other tests I have run, and a TLB miss penalty of 38 cycles. I have run streams on several machines here (included below). The Challenge runs better when I compile on the Crimson (? - I'll track this down). If you would like me to run some small benchmarks for you on the 580 (or Challenge), let me know. Thanks again, Mark stream_d SGI Crimson R4000 os 4.0.5 ftn 3.10 f77 -O2 -mips2 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 91.99999682605267 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 61.5385 0.2700 0.2600 0.2800 Scaling : 59.2594 0.2750 0.2700 0.2800 Summing : 58.5367 0.4140 0.4100 0.4200 SAXPYing : 59.9999 0.4080 0.4000 0.4100 stream_d SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Challenge -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 122.9999981820583 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 48.4847 0.3410 0.3300 0.3500 Scaling : 47.0589 0.3451 0.3400 0.3600 Summing : 50.0000 0.4921 0.4800 0.5000 SAXPYing : 54.5456 0.4451 0.4400 0.4600 stream_d SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Crimson -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 117.0000027865171 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 57.1429 0.2891 0.2800 0.3000 Scaling : 55.1724 0.3001 0.2900 0.3100 Summing : 53.3334 0.4641 0.4500 0.4800 SAXPYing : 54.5456 0.4450 0.4400 0.4500 stream_s SGI Crimson R4000 os 4.0.5 ftn 3.10 f77 -O2 -mips2 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 47.00000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 57.1429 0.1471 0.1400 0.1500 Scaling : 53.3333 0.1571 0.1500 0.1700 Summing : 54.5455 0.2270 0.2200 0.2300 SAXPYing : 52.1739 0.2371 0.2300 0.2500 stream_s SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Challenge -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 61.00000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 40.0000 0.2061 0.2000 0.2100 Scaling : 36.3636 0.2281 0.2200 0.2400 Summing : 41.3793 0.2981 0.2900 0.3100 SAXPYing : 48.0000 0.2550 0.2500 0.2600 stream_s SGI Challenge R4400 os 5.0 ftn 3.10 f77 -O2 -mips2 compiled on Crimson -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 62.00000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 53.3333 0.1591 0.1500 0.1700 Scaling : 47.0589 0.1741 0.1700 0.1900 Summing : 50.0000 0.2440 0.2400 0.2500 SAXPYing : 48.0000 0.2571 0.2500 0.2700 stream_d SUN SS2 os 4.1.2 f77 SC1.0 f77 -O2 -cg89 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 382.99998268485 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 20.5129 0.8168 0.7800 1.0100 Scaling : 18.6047 0.8640 0.8600 0.8900 Summing : 21.8182 1.1482 1.1000 1.5000 SAXPYing : 22.4299 1.1078 1.0700 1.3400 stream_d SUN SS10/41 os 4.1.3 f77 SC2.0.1 f77 -O2 -cg92 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 238.33334110677 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 34.2857 0.4868 0.4667 0.5000 Scaling : 38.4001 0.4267 0.4167 0.4333 Summing : 36.9231 0.6534 0.6500 0.6667 SAXPYing : 37.8947 0.6467 0.6333 0.6500 stream_s SUN SS2 os 4.1.2 f77 SC1.0 f77 -O2 -cg89 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 191.000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 18.6046 0.4411 0.4300 0.4500 Scaling : 17.7778 0.4640 0.4500 0.4700 Summing : 19.3548 0.6270 0.6200 0.6300 SAXPYing : 20.6897 0.6021 0.5800 0.6100 stream_s SUN SS10/41 os 4.1.3 f77 SC2.0.1 f77 -O2 -cg92 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 131.667 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 36.9232 0.2302 0.2167 0.2500 Scaling : 34.2857 0.2418 0.2333 0.2500 Summing : 37.8947 0.3301 0.3167 0.3333 SAXPYing : 36.0000 0.3401 0.3333 0.3500 stream_s RS/6000-580 xlf 2.03 f77 -O Test #1 Failed = picalc=piexact Apparently Single=Double Precision Proceeding to Test #2 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 67.00000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 177.7778 .1880 .1800 .1900 Scaling : 118.5185 .2841 .2700 .2900 Summing : 137.1429 .3551 .3500 .3800 SAXPYing : 141.1765 .3551 .3400 .3700 -------------------------------------- stream_d RS/6000-580 xlf 2.03 f77 -O Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 133.000000000000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 266.6667 .2512 .2400 .2700 Scaling : 246.1538 .2661 .2600 .2800 Summing : 234.1463 .4261 .4100 .4500 SAXPYing : 228.5714 .4352 .4200 .4600 From alan@msc.edu Tue Apr 20 18:32:54 1993 Received: from noc.msc.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA04401; Tue, 20 Apr 93 18:32:54 -0400 Received: from af.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324)) id AA18958; Tue, 20 Apr 93 17:36:07 -0500 Received: by af.msc.edu (5.57/MSC/v3.0(901107)) id AA01835; Tue, 20 Apr 93 17:36:06 -0500 Date: Tue, 20 Apr 93 17:36:06 -0500 From: alan@msc.edu Message-Id: <9304202236.AA01835@af.msc.edu> To: mccalpin Subject: Re: New Chips & Memory Bandwidth Newsgroups: comp.arch In-Reply-To: References: <1993Apr15.151349.9383@walter.cray.com> Organization: Minnesota Supercomputer Center, Inc. Cc: Status: RO In article you write: < To: mccalpin Subject: oops Status: RO The values I gave you for n were erroneous. For stream_s, n=256 million (not 128 mil). For stream_d, n=128 million (not 64 mil). Sorry. -- Alan E. Klietz Minnesota Supercomputer Center, Inc. 1200 Washington Avenue South Minneapolis, MN 55415 Tel: +1 612 337 3520 Internet: alan@msc.edu Fax: +1 612 337 3400 From jgm@doug.Econ.QueensU.CA Wed Apr 21 10:05:09 1993 Received: from doug.econ.QueensU.CA by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA05564; Wed, 21 Apr 93 10:05:09 -0400 Received: by doug.Econ.QueensU.CA (AIX 3.2/UCB 5.64/4.03) id AA03075; Wed, 21 Apr 1993 10:08:43 -0400 Date: Wed, 21 Apr 1993 10:03:06 +22300346 (EDT) From: "James G. MacKinnon" Subject: streams benchmark To: John McCalpin Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO I tried your streams benchmark on my new RS/6000 355, using xlf 2.3 with -O3 (when I used the preprocessors, they apparently optimized everything away). Here are the results: out.IBM_355_d (n=1000000) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 46.0000008344650269 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 133.3335 .1210 .1200 .1300 Scaling : 133.3335 .1281 .1200 .1300 Summing : 114.2860 .2100 .2100 .2100 SAXPYing : 120.0001 .2061 .2000 .2100 out.IBM_355_s (n=2000000) -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 56.00000000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 100.0000 .1641 .1600 .1700 Scaling : 84.2105 .1951 .1900 .2000 Summing : 85.7144 .2830 .2800 .2900 SAXPYing : 85.7144 .2800 .2800 .2800 On this benchmark, my rather inexpensive 355 seems to compare very well with much more expensive machines from Sun and SGI. ********************************************************************** James G. MacKinnon Department of Economics phone: 613 545-2293 Queen's University Fax: 613 545-6668 Kingston, Ontario, Canada Internet: jgm@doug.econ.queensu.ca K7L 3N6 From jfc@Athena.MIT.EDU Wed Apr 21 10:24:07 1993 Received: from ACHATES.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA05619; Wed, 21 Apr 93 10:24:07 -0400 Received: by Achates.MIT.EDU (5.61) id AA25108; Wed, 21 Apr 93 10:27:24 -0400 Message-Id: <9304211427.AA25108@Achates.MIT.EDU> To: mccalpin Subject: stream program result Date: Wed, 21 Apr 1993 10:27:23 EDT From: John Carr Status: RO On a VAX 9000/420 (62.5 Mhz clock) running Ultrix 4.2, compiled with fort -O -V vector, parameter N changed to 1200000 for more accurate results. The test only used 1 processor. ------------------------------------- Double precision appears to have 17 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word ------------------------------------- Timing calibration ; time = 239.9999871850014 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- unction Rate (MB/s) RMS time Min time Max time ssignment: 144.0001 0.1486 0.1333 0.1667 caling : 164.5719 0.1391 0.1167 0.1667 umming : 172.7997 0.1972 0.1667 0.2167 AXPYing : 157.0913 0.1918 0.1833 0.2000 ------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word ------------------------------------- Timing calibration ; time = 126.6667 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- unction Rate (MB/s) RMS time Min time Max time ssignment: 82.2860 0.1269 0.1167 0.1333 caling : 82.2857 0.1335 0.1167 0.1500 umming : 78.5455 0.2020 0.1833 0.2333 AXPYing : 78.5455 0.2036 0.1833 0.2167 From jgm@doug.Econ.QueensU.CA Wed Apr 21 11:14:13 1993 Received: from doug.econ.QueensU.CA by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA05742; Wed, 21 Apr 93 11:14:13 -0400 Received: by doug.Econ.QueensU.CA (AIX 3.2/UCB 5.64/4.03) id AA11865; Wed, 21 Apr 1993 11:18:12 -0400 Date: Wed, 21 Apr 1993 11:13:46 +22300346 (EDT) From: "James G. MacKinnon" Subject: Re: streams benchmark To: "John D. McCalpin" In-Reply-To: <9304211412.AA05575@perelandra.cms.udel.edu> Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: RO I believe the 355 runs at the strange speed that was introduced with the 550 and also used by the 350. Some of my literature from IBM calls it 41 Mh and some calls it 42. I think it is actually something like 41.7. The main differences between a 355 and a 350 are the larger instruction cache for the former (32K vs 8K) and the greater memory and slots of the latter (the 355 only has one memory slot, for 128MB maximum, and two Micro Channel slots, one of which is filled by a GT3i graphics adapter). The 365 and 375 are like the 355, but run at 50 and 62.5 Mh, respectively. ********************************************************************** James G. MacKinnon Department of Economics phone: 613 545-2293 Queen's University Fax: 613 545-6668 Kingston, Ontario, Canada Internet: jgm@doug.econ.queensu.ca K7L 3N6 From alan@msc.edu Wed Apr 21 15:25:47 1993 Received: from noc.msc.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA07169; Wed, 21 Apr 93 15:25:47 -0400 Received: from af.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324)) id AA29421; Wed, 21 Apr 93 13:58:29 -0500 Received: by af.msc.edu (5.57/MSC/v3.0(901107)) id AA02238; Wed, 21 Apr 93 13:58:29 -0500 Date: Wed, 21 Apr 93 13:58:29 -0500 From: alan@msc.edu Message-Id: <9304211858.AA02238@af.msc.edu> To: mccalpin Subject: Re: New Chips & Memory Bandwidth Status: RO >>Assignment:********** 0.0264 0.0167 0.0333 >>Scaling :********** 0.0279 0.0167 0.0333 >>Summing :********** 0.0441 0.0333 0.0500 >>SAXPYing :********** 0.0425 0.0333 0.0500 > >128 MB/(0.0264 s) = 4848 MB/s = 19 MB/s/cpu = 1/25 of peak >Not so good for a first cut? n=128 million (elements), but each element is 8 bytes wide, so it should be 19*8 = 152 MB/s/cpu. >Of course, it is impossible to say anything very intelligent given >these numbers. Hopefully the TMC folks can manage something a bit >more concrete. Agreed. From keith@earth.ox.ac.uk Thu Apr 22 10:57:05 1993 Received: from eeyore.earth.ox.ac.uk by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA09931; Thu, 22 Apr 93 10:57:05 -0400 From: Keith Refson Received: from rahman.earth.earth.ox.ac.uk (rahman.earth.ox.ac.uk) by earth.ox.ac.uk; Thu, 22 Apr 93 16:00:23 BST Date: Thu, 22 Apr 93 16:00:23 BST Message-Id: <18135.9304221500@rahman.earth.earth.ox.ac.uk> To: mccalpin Subject: Memory bandwidth tests Status: RO Dear Prof. McCalpin, I am a little puzzled by the results of your "bandwidth" benchmark, and I wonder if you could explain them to me. In a nutshell, I don't see how you get the memory bandwidth figures from the results of running your program. I have some money to invest in a machine to do serious numerical calculations which are pretty memory -intensive and so I am keenly interested in the results. I would also be very interested in your opinions of the various competitors. I am looking at IBM/HP/DECAlpha /SGI Challenge machines. You may be interested in my benchmarking results for these and other machines. The program is a MD simulation code written by me in C. The comparison of the "small", ie in-cache runs with the "big" 30MB ones is *very* revealing and does sort out the sheep from the goats. If so, just get the file "/pub/benchmark.tex" by anonymous ftp from earth.ox.ac.uk (or eeyore.earth.ox.ac.uk or 163.1.22.1 -- we changed our DNS records recently and it may not have propagated yet). In any case I have some results from running your benchmark for DEX Alpha/HP 755/735 and STardent Titan P3. My titan results differ substantially from those you have reported -- I don't know why. But my results agree well with the theoretical bus bandwidth of 256 MB/sec. sincerely Keith Refson -------------------------------------------------- Here are the results. HP 755/735 feynman 22: ./a.out -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 21.00000046193599 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 68.5715 .0700 .0700 .0700 Scaling : 68.5715 .0700 .0700 .0700 Summing : 72.0001 .1021 .1000 .1100 SAXPYing : 80.0001 .0971 .0900 .1000 DEC 3000/500 (150MHz AXP Alpha) axpbb% ./a.out -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 16.49440079927444 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 100.3680 0.0485 0.0478 0.0488 Scaling : 96.4323 0.0499 0.0498 0.0508 Summing : 98.3607 0.0736 0.0732 0.0742 SAXPYing : 99.6900 0.0730 0.0722 0.0732 Kubota (ex Stardent Titan P3 ) 1processor zebedee% a.out -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 37.00000196695328 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 240.0002 0.0265 0.0200 0.0300 Scaling : 240.0002 0.0283 0.0200 0.0300 Summing : 239.9993 0.0391 0.0300 0.0400 SAXPYing : 144.0001 0.0522 0.0500 0.0600 Kubota (ex Stardent Titan P3 ) 4 processors zebedee% a.out -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 79.99998927116394 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 480.0005 0.0228 0.0100 0.0300 Scaling : 240.0002 0.0349 0.0200 0.0900 Summing : 240.0002 0.0363 0.0300 0.0400 SAXPYing : 240.0002 0.0449 0.0300 0.0600 From mash@mash.wpd.sgi.com Fri Apr 23 11:45:34 1993 Received: from SGI.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA14952; Fri, 23 Apr 93 11:45:34 -0400 Received: from [192.26.61.16] by sgi.sgi.com via SMTP (920330.SGI/910110.SGI) for mccalpin@perelandra.cms.udel.edu id AA28899; Fri, 23 Apr 93 08:48:56 -0700 Received: by mash.wpd.sgi.com (920330.SGI/911001.SGI) for @sgi.sgi.com:mccalpin@perelandra.cms.udel.edu id AA16415; Fri, 23 Apr 93 08:48:55 -0700 From: mash@mash.wpd.sgi.com (John R. Mashey) Message-Id: <9304230848.ZM16413@mash.wpd.sgi.com> Date: Fri, 23 Apr 1993 08:48:55 -0700 In-Reply-To: "John D. McCalpin" "Re: CPU Speed vs. Memory Bandwidth" (Apr 22, 10:39pm) References: <8735@fury.BOEING.COM> <9304230239.AA12899@perelandra.cms.udel.edu> X-Mailer: Z-Mail (2.1.0 10/1/92) To: "John D. McCalpin" Subject: Re: CPU Speed vs. Memory Bandwidth Status: RO On Apr 22, 10:39pm, "John D. McCalpin" wrote: > Subject: Re: CPU Speed vs. Memory Bandwidth > In article you write: > >For example, each memory board in a Challenge supports two-eay interleaving: > >each leaf can start a new read request every 200ns, but the system bus can deliver request 2X faster. For example, if you had an infinitely fast CPU Key phrase: -------------------------------------------^^^^^^^^^^^^^^^^^^^ (Current R4400 CPUs are not infinitely fast :-) Thw wording was carefully chosen! > latencies) gives the observed 60 MB/s limit. > > Is it supposed to be different? Sounds about right. As you know: 1) current R4400s have 2-level caches, and relatively little overlap. 2) 150Mhz ones are faster, but not different. 3) TFPs have 1-level caches (as far as FP goes) and more overlap. 4) and T5s will have much more overlap, more prefetching, etc. From lamaster@george.arc.nasa.gov Fri Apr 23 11:47:40 1993 Received: from george.arc.nasa.gov by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA14986; Fri, 23 Apr 93 11:47:40 -0400 Received: by george.arc.nasa.gov (4.1/1.35) id AA15325; Fri, 23 Apr 93 08:51:04 PDT Date: Fri, 23 Apr 93 08:51:04 PDT From: lamaster@george.arc.nasa.gov (Hugh LaMaster -- RCS) Message-Id: <9304231551.AA15325@george.arc.nasa.gov> To: mccalpin Subject: Re: CPU Speed vs. Memory Bandwidth Status: RO >Except that one of the tests is simply a copy, which should not >require any FPU intervention. I guess I bungled my attempt to express what you just wrote. Sorry about that. "Compute" was a poor choice of words intended to express that we are looking at CPU load/store here, plus maybe FP operations, but not the raw speed of the bus. John Mashey's post is typical of what you get from a vendor -- the guaranteed not to exceed speed of the bus. However, in the case of the SGI, only a small fraction is available to an application in any single CPU. I admit that I am disappointed by the SGI numbers. There seems to be a bit of a cache bottleneck there. The DEC 3000/500 looks better: and, I am told, has better random access speed than IBM. I hope you don't mind my posting your results. I couldn't resist the opportunity to bring up my favorite subject while responding to someone else. > >>Does anyone have numbers for the new DEC AXP systems, and new HP >>7100-based systems? Can someone from DEC or HP post results? > >I have received a number of new results lately -- check the archive. >These include SGI Challenge, SGI Crimson, DEC 3000/500, HP 9000/755, >and IBM RS/6000-580. The results are very interesting.... I had fetched the previous week's table, which had only the SGI and IBM. I just fetched the latest and posted the HP and DEC numbers as a supplement (I hope before 20 other people do as well.) Thanks! Regards, Hugh LaMaster From mash@mash.wpd.sgi.com Fri Apr 23 18:06:14 1993 Received: from SGI.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA17432; Fri, 23 Apr 93 18:06:14 -0400 Received: from [192.26.61.16] by sgi.sgi.com via SMTP (920330.SGI/910110.SGI) for mccalpin@perelandra.cms.udel.edu id AA28729; Fri, 23 Apr 93 15:09:39 -0700 Received: by mash.wpd.sgi.com (920330.SGI/911001.SGI) for @sgi.sgi.com:mccalpin@perelandra.cms.udel.edu id AA16850; Fri, 23 Apr 93 15:09:37 -0700 From: mash@mash.wpd.sgi.com (John R. Mashey) Message-Id: <9304231509.ZM16848@mash.wpd.sgi.com> Date: Fri, 23 Apr 1993 15:09:36 -0700 In-Reply-To: "John D. McCalpin" "Re: CPU Speed vs. Memory Bandwidth" (Apr 23, 11:53am) References: <8735@fury.BOEING.COM> <9304230239.AA12899@perelandra.cms.udel.edu> <9304231553.AA14994@perelandra.cms.udel.edu> X-Mailer: Z-Mail (2.1.0 10/1/92) To: "John D. McCalpin" Subject: Re: CPU Speed vs. Memory Bandwidth Status: RO On Apr 23, 11:53am, "John D. McCalpin" wrote: > Subject: Re: CPU Speed vs. Memory Bandwidth > > For example, if you had an infinitely fast CPU > >Key phrase: -------------------------------------------^^^^^^^^^^^^^^^^^^^ > >(Current R4400 CPUs are not infinitely fast :-) Thw wording was carefully > >chosen! > > The bottleneck here is definitely not in the CPU, but in the cache Sorry: when I said CPU in this context, I meant the CPU subsystem up to the memory interface,(as opposed to the memory side of the interface). This may be wrong thinking on my part, but I do it because there are so many different partitionings of the CPU subsystem. After all, for an R4400, all of the cache control is part of the CPU chip itself... > It is interesting to hear that the TFP machines will not have level 2 > cache for the FP operands. The experience of Sun and SGI seems to be > that 2-level caches are a risky proposition, performance-wise. Well, yes, and no, depending on what people do. For what *you* do, TFPs will be a much better match. For many things, I'm quite happy with 2-level caches. (In fact, TFP has 2-level caches for the code and integer data, 1 level for FP data, i.e., it goes like this: CPU + I-cache + D-cache (1 chip) ------- | scache FPU ------- | that is, the FPU has direct access to multi-MB 4-set-associative cache, and the whole CPU+FP complex can do 4 of: 2 64-bit loads/stores, including index FP load/stores 2 FP Mult-adds (or other FP ops) 2 integer operations per cycle and quite obviously, this won't help integer performance much directly, i.e., not a lot faster than 150Mhz R4400 ... but it will certainly help your apps. Also, the interface from system bus <-> cache is 128 bits rather than the 64 of the R4400s. From gklaass@nexus.yorku.ca Wed Apr 28 12:21:52 1993 Received: from nexus.yorku.ca by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA01745; Wed, 28 Apr 93 12:21:52 -0400 Received: by nexus.yorku.ca id <9223>; Wed, 28 Apr 1993 12:25:29 -0400 From: Gary Klaassen To: mccalpin Subject: Re: SPECint92, and other benchmarks (really LINPACK) Newsgroups: comp.sys.sgi.hardware,comp.benchmarks References: <1993Apr13.225551.26609@texhrc.uucp> <1rlmriINNq9q@usenet.pa.dec.com> Message-Id: <93Apr28.122529edt.9223@nexus.yorku.ca> Date: Wed, 28 Apr 1993 12:25:16 -0400 Status: RO John: >>In article gklaass@nexus.yorku.ca (Gary Klaassen) writes: >>>Several others have pointed out that TFP Power Challenge will >>>incorporate a streaming cache capable of supplying data to the >>>floating point units at a rate of 21.6 GBytes per second. This is >>>presumably the rate between the cache and FPU, but what about the rate >>>between memory and cache? You replied: >There are two levels of cache in the current machine (the "Challenge"), >but I am told that the "Power Challenge" will bypass the first-level >cache for FP operands. This jives with what John Mashey told me. The Power Challenge Icache will be 32KB, the Dcache is 16KB, and the secondary external cache will have 2-16MB and a direct line to the FPU. >I do not know the exact specs for the primary cache interface on the >Challenge machines, except that the cache line is 16 bytes. Assuming There are 3 caches, 16KB Instruction, 16KB data and 1MB external. The filed engineer told me they have 10ns RAM. >a second-level cache hit has a latency of 3 cycles (chosen out of a >nearby hat) and a 128-bit wide interface, then the transfer time is >4 cycles, and the bandwidth is 300 MB/s. Since the transfer time >is much smaller than the latency, this throughput estimate is roughly >a linear function of the latency. Any better numbers from SGI? >The traffic between secondary cache and main memory goes over the main >system bus with a peak throughput of 1.2 GB/s. Unfortunately, a single cpu >can only get about 1/20 of this performance via the cache interface. The >problem is that the latency is about 55-60 cycles (at 75 MHz). The >secondary cache lines are 128 bytes, and the bus width is 256 bits. So the >cache miss time is 60 cycles latency plus 4 cycles for the actual data >transfer. This gives a peak throughput of about 90 MB/s (note that the >bus clock is about 48 MHz, while the cpu external clock is 75 MHz). Yeah, tell me about it. We bought one of these Challenge L's only to discover it's memory throughput is more or less the same as an Indigo. This is of course because the memory subsystem is basically the same as an Indigo (some differences for SMP). Foir some reason they negelcted to mention this. Moral: don't trust marketing numbers. >Observed throughput from my "STREAM" benchmark shows about 60 MB/s on >a single-cpu system. Results are available by anonymous ftp to >perelandra.cms.udel.edu in bench/stream/. I am curious as to why the IBM 580 numbers are so much higher? How much of this is due to the memory bandwidth and how much is due to the superscalar nature of the RS6000 cpu? The benchmarks you chose are well suited to RS6000 superscalar and would disadvantage some of the other chips in the table. >Since this throughput is dominated by latency, presumably faster cache >controller hardware in future revisions will run faster. The TFP chip >will be able to cut the latency immediately by bypassing the first- >level cache for FP operands. I certainly hope so. Regards, Gary Prof. G. P. Klaassen Dept. of Earth and Atmospheric Science York University, North York, Ontario, Canada M3J 1P3 Email: gklaass@nexus.yorku.ca From rsw@decade.maths.unsw.EDU.AU Thu Apr 29 19:35:36 1993 Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA07375; Thu, 29 Apr 93 19:35:36 -0400 Received: from decade.maths.unsw.EDU.AU ([149.171.180.5]) by bach.udel.edu with SMTP (5.65c/IDA-1.2.8) id AA24874; Thu, 29 Apr 1993 19:39:16 -0400 Received: by decade.maths.unsw.EDU.AU (5.65/1.35) id AA05823; Fri, 30 Apr 1993 09:40:33 +1000 Date: Fri, 30 Apr 1993 09:40:33 +1000 From: rsw@decade.maths.unsw.EDU.AU Message-Id: <9304292340.AA05823@decade.maths.unsw.EDU.AU> To: mccalpin@bach.udel.edu Subject: Stream on CM5 Status: RO I have a modified version of your stream benchmark running on a CM5 and results for a 16 PN partition. Happy to let you have program and results if someone else has not already provided. In double precision the SAXPY results are about 7GB/s (theoretical paek of 8 GB/s with 16 PNs). Single precision much worse. Dr. Rob Womersley E-mail: R.Womersley@unsw.edu.au School of Mathematics rsw@hydra.maths.unsw.edu.au University of New South Wales Phone: 61 - 2 - 697-2998 P.O. Box 1, Kensington NSW 2033 Fax: 61 - 2 - 662-6445 AUSTRALIA From rsw@hydra.maths.unsw.EDU.AU Thu Apr 29 21:25:22 1993 Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA07503; Thu, 29 Apr 93 21:25:22 -0400 Received: by hydra.maths.unsw.EDU.AU (5.61/1.35) id AA00588; Fri, 30 Apr 93 11:30:31 +1000 Date: Fri, 30 Apr 93 11:30:31 +1000 From: rsw@hydra.maths.unsw.EDU.AU Message-Id: <9304300130.AA00588@hydra.maths.unsw.EDU.AU> To: mccalpin Subject: Re: Stream on CM5 Status: RO * Program: Stream * Programmer: John D. McCalpin * Revision: 2.0, September 30,1991 * * CM5 Data Parallel version by Rob Womersely 19 April, 1993 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * * INSTRUCTIONS: * 1) Stream requires a cpu timing function called second(). * A sample is shown below. This is unfortunately rather * system dependent. It helps to know the granularity of the * timing. The code below assumes that the granularity is * 1/100 seconds. * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the second line of the main * program to give a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonable good, on the * other hand, the optimizer might be too smart for me! * 4) Mail the results to mccalpin@perelandra.cms.udel.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks! * PROGRAM stream C .. Parameters .. INTEGER, PARAMETER :: n = 5000000, ntimes = 20 C .. C .. Local Scalars .. DOUBLE PRECISION t, t0 INTEGER j, k, nbpw, nvu C .. C .. Local Arrays .. DOUBLE PRECISION, ARRAY(n) :: a, b, c CMF$ LAYOUT a(:news), b(:news), c(:news) DOUBLE PRECISION, ARRAY(4) :: maxtime,mintime,rmstime DOUBLE PRECISION, ARRAY(4, ntimes) :: times INTEGER bytes(4) CHARACTER label(4)*12 C .. C .. External Functions .. INTEGER CMF_number_of_processors DOUBLE PRECISION CM_timer_read_cm_busy, CM_timer_read_cm_idle DOUBLE PRECISION CM_timer_read_elapsed EXTERNAL CMF_number_of_processors EXTERNAL CM_timer_read_cm_busy, CM_timer_read_cm_idle EXTERNAL CM_timer_read_elapsed INTEGER realsize EXTERNAL realsize C .. C .. Intrinsic Functions .. INTRINSIC dble,max,min,sqrt C .. C .. Data statements .. DATA label/' Assignment:',' Scaling :',' Summing :', $ ' SAXPYing :'/ DATA bytes/2,2,3,3/ C .. * --- SETUP --- determine precision and check timing --- PRINT *,'STREAM: Measure memory transfer rates in MB/s' PRINT *,'for simple computational kernels in Fortran' PRINT * PRINT *,'CALL CMF_describe_array(a)' CALL CMF_describe_array(a) PRINT * nvu = CMF_number_of_processors() WRITE(*,'(/1x,A,I2,A,I2,A/)') 'CM5 with partition of ',nvu/4, $ ' processors ( ',nvu,' vector units )' nbpw = realsize() CALL CM_timer_clear(0) CALL CM_timer_start(0) a = 1.0D0 b = 2.0D0 c = 0.0D0 CALL CM_timer_stop(0) t = CM_timer_read_elapsed(0) PRINT * PRINT *,'Vector length = ', n PRINT *,'Timing calibration: Time = ',t*100,' hundredths', $ ' of a second' PRINT *,'Increase the size of the arrays if this is < 30' PRINT *,'and your clock precision is =< 1/100 second' * --- MAIN LOOP --- repeat test cases NTIMES times --- DO 60 k = 1, ntimes CALL CM_timer_clear(1) CALL CM_timer_start(1) c = a CALL CM_timer_stop(1) t = CM_timer_read_elapsed(1) times(1,k) = t CALL CM_timer_clear(2) CALL CM_timer_start(2) c = 3.0D0 * a CALL CM_timer_stop(2) t = CM_timer_read_elapsed(2) times(2,k) = t CALL CM_timer_clear(3) CALL CM_timer_start(3) c = a + b CALL CM_timer_stop(3) t = CM_timer_read_elapsed(3) times(3,k) = t CALL CM_timer_clear(4) CALL CM_timer_start(4) c = a + 3.0D0 * b CALL CM_timer_stop(4) t = CM_timer_read_elapsed(4) times(4,k) = t 60 CONTINUE * --- SUMMARY --- rmstime = SUM(times**2, DIM=2) rmstime = SQRT( rmstime/dble(ntimes) ) mintime = MINVAL(times, DIM=2) maxtime = MAXVAL(times, DIM=2) WRITE (*,FMT=9000) DO 90 j = 1,4 WRITE (*,FMT=9010) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6, $ rmstime(j),mintime(j),maxtime(j) 90 CONTINUE 9000 FORMAT (/1x, 57('-'),/,' Function :',1x, $ 'Rate (MB/s) RMS time Min time Max time') 9010 FORMAT (a,4(f10.4,2x)) END *------------------------------------- * INTEGER FUNCTION dblesize() * * A semi-portable way to determine the precision of DOUBLEPRECISION * in Fortran. * Here used to guess how many bytes of storage a DOUBLEPRECISION * number occupies. * INTEGER FUNCTION realsize() C .. Local Scalars .. DOUBLE PRECISION result,test INTEGER j,ndigits C .. C .. Local Arrays .. DOUBLE PRECISION ref(30) C .. C .. External Subroutines .. EXTERNAL dummy C .. C .. Intrinsic Functions .. INTRINSIC abs,acos,log10,sqrt C .. C Test #1 - compare single(1.0d0+delta) to 1.0d0 10 DO 20 j = 1,30 ref(j) = 1.0d0 + 10.0d0**(-j) 20 CONTINUE DO 30 j = 1,30 test = ref(j) ndigits = j CALL dummy(test,result) IF (test.EQ.1.0D0) THEN GO TO 50 END IF 30 CONTINUE GOTO 60 50 WRITE (*,FMT='(a)') ' --------------------------------------' WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ', $ ndigits,' digits of accuracy' IF (ndigits.LE.8) THEN realsize = 4 ELSE realsize = 8 END IF WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize, $ ' bytes per DOUBLEPRECISION word' WRITE (*,FMT='(a)') ' --------------------------------------' RETURN 60 PRINT *,' Hmmmm. I am unable to determine the size of a REAL' PRINT *,' Please enter the number of Bytes per DOUBLEPRECISION', $ ' number : ' READ (*,FMT=*) realsize IF (realsize.NE.4 .AND. realsize.NE.8) THEN PRINT *,' Your answer ',realsize,' does not make sense!' PRINT *,' Try again!' PRINT *,' Please enter the number of Bytes per ', $ 'REAL number : ' READ (*,FMT=*) realsize END IF PRINT *,'You have manually entered a size of ',realsize, $ ' bytes per REAL number' WRITE (*,FMT='(a)') '--------------------------------------' END SUBROUTINE dummy(q,r) C .. Scalar Arguments .. DOUBLE PRECISION q,r C .. C .. Intrinsic Functions .. INTRINSIC cos C .. r = cos(q) RETURN END From rsw@hydra.maths.unsw.EDU.AU Thu Apr 29 21:29:24 1993 Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA07508; Thu, 29 Apr 93 21:29:24 -0400 Received: by hydra.maths.unsw.EDU.AU (5.61/1.35) id AA01613; Fri, 30 Apr 93 11:34:33 +1000 Date: Fri, 30 Apr 93 11:34:33 +1000 From: rsw@hydra.maths.unsw.EDU.AU Message-Id: <9304300134.AA01613@hydra.maths.unsw.EDU.AU> To: mccalpin Subject: CM5 stream_d.fcm Status: RO * Program: Stream * Programmer: John D. McCalpin * Revision: 2.0, September 30,1991 * * CM5 Data Parallel version by Rob Womersely 19 April, 1993 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * * INSTRUCTIONS: * 1) Stream requires a cpu timing function called second(). * A sample is shown below. This is unfortunately rather * system dependent. It helps to know the granularity of the * timing. The code below assumes that the granularity is * 1/100 seconds. * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the second line of the main * program to give a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonable good, on the * other hand, the optimizer might be too smart for me! * 4) Mail the results to mccalpin@perelandra.cms.udel.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks! * PROGRAM stream C .. Parameters .. INTEGER, PARAMETER :: n = 5000000, ntimes = 20 C .. C .. Local Scalars .. DOUBLE PRECISION t, t0 INTEGER j, k, nbpw, nvu C .. C .. Local Arrays .. DOUBLE PRECISION, ARRAY(n) :: a, b, c CMF$ LAYOUT a(:news), b(:news), c(:news) DOUBLE PRECISION, ARRAY(4) :: maxtime,mintime,rmstime DOUBLE PRECISION, ARRAY(4, ntimes) :: times INTEGER bytes(4) CHARACTER label(4)*12 C .. C .. External Functions .. INTEGER CMF_number_of_processors DOUBLE PRECISION CM_timer_read_cm_busy, CM_timer_read_cm_idle DOUBLE PRECISION CM_timer_read_elapsed EXTERNAL CMF_number_of_processors EXTERNAL CM_timer_read_cm_busy, CM_timer_read_cm_idle EXTERNAL CM_timer_read_elapsed INTEGER realsize EXTERNAL realsize C .. C .. Intrinsic Functions .. INTRINSIC dble,max,min,sqrt C .. C .. Data statements .. DATA label/' Assignment:',' Scaling :',' Summing :', $ ' SAXPYing :'/ DATA bytes/2,2,3,3/ C .. * --- SETUP --- determine precision and check timing --- PRINT *,'STREAM: Measure memory transfer rates in MB/s' PRINT *,'for simple computational kernels in Fortran' PRINT * PRINT *,'CALL CMF_describe_array(a)' CALL CMF_describe_array(a) PRINT * nvu = CMF_number_of_processors() WRITE(*,'(/1x,A,I2,A,I2,A/)') 'CM5 with partition of ',nvu/4, $ ' processors ( ',nvu,' vector units )' nbpw = realsize() CALL CM_timer_clear(0) CALL CM_timer_start(0) a = 1.0D0 b = 2.0D0 c = 0.0D0 CALL CM_timer_stop(0) t = CM_timer_read_elapsed(0) PRINT * PRINT *,'Vector length = ', n PRINT *,'Timing calibration: Time = ',t*100,' hundredths', $ ' of a second' PRINT *,'Increase the size of the arrays if this is < 30' PRINT *,'and your clock precision is =< 1/100 second' * --- MAIN LOOP --- repeat test cases NTIMES times --- DO 60 k = 1, ntimes CALL CM_timer_clear(1) CALL CM_timer_start(1) c = a CALL CM_timer_stop(1) t = CM_timer_read_elapsed(1) times(1,k) = t CALL CM_timer_clear(2) CALL CM_timer_start(2) c = 3.0D0 * a CALL CM_timer_stop(2) t = CM_timer_read_elapsed(2) times(2,k) = t CALL CM_timer_clear(3) CALL CM_timer_start(3) c = a + b CALL CM_timer_stop(3) t = CM_timer_read_elapsed(3) times(3,k) = t CALL CM_timer_clear(4) CALL CM_timer_start(4) c = a + 3.0D0 * b CALL CM_timer_stop(4) t = CM_timer_read_elapsed(4) times(4,k) = t 60 CONTINUE * --- SUMMARY --- rmstime = SUM(times**2, DIM=2) rmstime = SQRT( rmstime/dble(ntimes) ) mintime = MINVAL(times, DIM=2) maxtime = MAXVAL(times, DIM=2) WRITE (*,FMT=9000) DO 90 j = 1,4 WRITE (*,FMT=9010) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6, $ rmstime(j),mintime(j),maxtime(j) 90 CONTINUE 9000 FORMAT (/1x, 57('-'),/,' Function :',1x, $ 'Rate (MB/s) RMS time Min time Max time') 9010 FORMAT (a,4(f10.4,2x)) END *------------------------------------- * INTEGER FUNCTION dblesize() * * A semi-portable way to determine the precision of DOUBLEPRECISION * in Fortran. * Here used to guess how many bytes of storage a DOUBLEPRECISION * number occupies. * INTEGER FUNCTION realsize() C .. Local Scalars .. DOUBLE PRECISION result,test INTEGER j,ndigits C .. C .. Local Arrays .. DOUBLE PRECISION ref(30) C .. C .. External Subroutines .. EXTERNAL dummy C .. C .. Intrinsic Functions .. INTRINSIC abs,acos,log10,sqrt C .. C Test #1 - compare single(1.0d0+delta) to 1.0d0 10 DO 20 j = 1,30 ref(j) = 1.0d0 + 10.0d0**(-j) 20 CONTINUE DO 30 j = 1,30 test = ref(j) ndigits = j CALL dummy(test,result) IF (test.EQ.1.0D0) THEN GO TO 50 END IF 30 CONTINUE GOTO 60 50 WRITE (*,FMT='(a)') ' --------------------------------------' WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ', $ ndigits,' digits of accuracy' IF (ndigits.LE.8) THEN realsize = 4 ELSE realsize = 8 END IF WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize, $ ' bytes per DOUBLEPRECISION word' WRITE (*,FMT='(a)') ' --------------------------------------' RETURN 60 PRINT *,' Hmmmm. I am unable to determine the size of a REAL' PRINT *,' Please enter the number of Bytes per DOUBLEPRECISION', $ ' number : ' READ (*,FMT=*) realsize IF (realsize.NE.4 .AND. realsize.NE.8) THEN PRINT *,' Your answer ',realsize,' does not make sense!' PRINT *,' Try again!' PRINT *,' Please enter the number of Bytes per ', $ 'REAL number : ' READ (*,FMT=*) realsize END IF PRINT *,'You have manually entered a size of ',realsize, $ ' bytes per REAL number' WRITE (*,FMT='(a)') '--------------------------------------' END SUBROUTINE dummy(q,r) C .. Scalar Arguments .. DOUBLE PRECISION q,r C .. C .. Intrinsic Functions .. INTRINSIC cos C .. r = cos(q) RETURN END From rsw@hydra.maths.unsw.EDU.AU Thu Apr 29 21:29:45 1993 Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA07512; Thu, 29 Apr 93 21:29:45 -0400 Received: by hydra.maths.unsw.EDU.AU (5.61/1.35) id AA01728; Fri, 30 Apr 93 11:34:57 +1000 Date: Fri, 30 Apr 93 11:34:57 +1000 From: rsw@hydra.maths.unsw.EDU.AU Message-Id: <9304300134.AA01728@hydra.maths.unsw.EDU.AU> To: mccalpin Subject: CM5 stream_s.fcm Status: RO * Program: Stream * Programmer: John D. McCalpin * Revision: 2.0, September 30,1991 * * CM5 Data Parallel version by Rob Womersely 19 April, 1993 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * * INSTRUCTIONS: * 1) Stream requires a cpu timing function called second(). * A sample is shown below. This is unfortunately rather * system dependent. It helps to know the granularity of the * timing. The code below assumes that the granularity is * 1/100 seconds. * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the second line of the main * program to give a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonable good, on the * other hand, the optimizer might be too smart for me! * 4) Mail the results to mccalpin@perelandra.cms.udel.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks! * PROGRAM stream C .. Parameters .. INTEGER, PARAMETER :: n = 5000000, ntimes = 20 C .. C .. Local Scalars .. REAL t, t0 INTEGER j, k, nbpw, nvu C .. C .. Local Arrays .. REAL, ARRAY(n) :: a, b, c CMF$ LAYOUT a(:news), b(:news), c(:news) REAL, ARRAY(4) :: maxtime,mintime,rmstime REAL, ARRAY(4, ntimes) :: times INTEGER bytes(4) CHARACTER label(4)*12 C .. C .. External Functions .. INTEGER CMF_number_of_processors DOUBLE PRECISION CM_timer_read_cm_busy, CM_timer_read_cm_idle DOUBLE PRECISION CM_timer_read_elapsed EXTERNAL CMF_number_of_processors EXTERNAL CM_timer_read_cm_busy, CM_timer_read_cm_idle EXTERNAL CM_timer_read_elapsed INTEGER realsize EXTERNAL realsize C .. C .. Intrinsic Functions .. INTRINSIC float,max,min,sqrt C .. C .. Data statements .. DATA label/' Assignment:',' Scaling :',' Summing :', $ ' SAXPYing :'/ DATA bytes/2,2,3,3/ C .. * --- SETUP --- determine precision and check timing --- PRINT *,'STREAM: Measure memory transfer rates in MB/s' PRINT *,'for simple computational kernels in Fortran' PRINT * PRINT *,'CALL CMF_describe_array(a)' CALL CMF_describe_array(a) nvu = CMF_number_of_processors() WRITE(*,'(/1x,A,I2,A,I2,A/)') 'CM5 with partition of ',nvu/4, $ ' processors ( ',nvu,' vector units )' nbpw = realsize() CALL CM_timer_clear(0) CALL CM_timer_start(0) a = 1.0 b = 2.0 c = 0.0 CALL CM_timer_stop(0) t = CM_timer_read_cm_busy(0) PRINT * PRINT *,'Vector length = ', n PRINT *,'Timing calibration: Time = ',t*100,' hundredths', $ ' of a second' PRINT *,'Increase the size of the arrays if this is < 30' PRINT *,'and your clock precision is = < 1/100 second' * --- MAIN LOOP --- repeat test cases NTIMES times --- DO 60 k = 1, ntimes CALL CM_timer_clear(1) CALL CM_timer_start(1) c = a CALL CM_timer_stop(1) t = CM_timer_read_elapsed(1) times(1,k) = t CALL CM_timer_clear(2) CALL CM_timer_start(2) c = 3.0 * a CALL CM_timer_stop(2) t = CM_timer_read_elapsed(2) times(2,k) = t CALL CM_timer_clear(3) CALL CM_timer_start(3) c = a + b CALL CM_timer_stop(3) t = CM_timer_read_elapsed(3) times(3,k) = t CALL CM_timer_clear(4) CALL CM_timer_start(4) c = a + 3.0 * b CALL CM_timer_stop(4) t = CM_timer_read_elapsed(4) times(4,k) = t 60 CONTINUE * --- SUMMARY --- rmstime = SUM(times**2, DIM=2) rmstime = SQRT( rmstime/float(ntimes) ) mintime = MINVAL(times, DIM=2) maxtime = MAXVAL(times, DIM=2) WRITE (*,FMT=9000) DO 90 j = 1,4 WRITE (*,FMT=9010) label(j),n*bytes(j)*nbpw/mintime(j)/1.0e6, $ rmstime(j),mintime(j),maxtime(j) 90 CONTINUE 9000 FORMAT (/1x,57('-'),/,' Function :',1x, $ 'Rate (MB/s) RMS time Min time Max time') 9010 FORMAT (a,4(f10.4,2x)) END *------------------------------------- * INTEGER FUNCTION realsize() * * A semi-portable way to determine the precision of default REAL * in Fortran. * Here used to guess how many bytes of storage a real number occupies. * INTEGER FUNCTION realsize() C Test #1 - compare double precision pi to acos(-1.0e0) C .. Local Scalars .. DOUBLE PRECISION pi REAL diff,picalc,result,test INTEGER j,ndigits C .. C .. Local Arrays .. DOUBLE PRECISION ref(30) C .. C .. External Subroutines .. EXTERNAL dummy C .. C .. Intrinsic Functions .. INTRINSIC abs,acos,log10,sqrt C .. pi = 3.14159265358979323846264338327950288d0 picalc = acos(-1.0e0) diff = abs(picalc-pi) IF (diff.EQ.0.0) THEN PRINT *,'Test #1 Failed = picalc=piexact' PRINT *,'Apparently Single=Double Precision' PRINT *,'Proceeding to Test #2' PRINT *,' ' GO TO 10 ELSE ndigits = -log10(abs(diff)) + 0.5 GO TO 50 END IF C Test #2 - compare single(1.0d0+delta) to 1.0e0 10 DO 20 j = 1,30 ref(j) = 1.0d0 + 10.0d0** (-j) 20 CONTINUE DO 30 j = 1,30 test = ref(j) ndigits = j CALL dummy(test,result) IF (test.EQ.1.0e0) THEN GO TO 50 END IF 30 CONTINUE PRINT *,'Test #2 failed - Precision appears to exceed 30 digits' PRINT *,'Proceeding to Test #3' GO TO 40 C Test #3 - abs(sqrt(1.0d0)-sqrt(1.0e0)) 40 diff = abs(sqrt(1.0d0)-sqrt(1.0e0)) IF (diff.EQ.0.0) THEN PRINT *,'Test Failed - sqrt(1.0e0)=sqrt(1.0d0)' PRINT *,'Apparently Single=Double Precision' PRINT *,'Giving up' GO TO 60 ELSE ndigits = -log10(abs(diff)) + 0.5 GO TO 50 END IF 50 WRITE (*,FMT='(a)') '--------------------------------------' WRITE (*,FMT='(1x,a,i2,a)') 'Single precision appears to have ', $ ndigits,' digits of accuracy' IF (ndigits.LE.8) THEN realsize = 4 ELSE realsize = 8 END IF WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize, $ ' bytes per default REAL word' WRITE (*,FMT='(a)') '--------------------------------------' RETURN 60 PRINT *,'Hmmmm. I am unable to determine the size of a REAL' PRINT *,'Please enter the number of Bytes per REAL number : ' READ (*,FMT=*) realsize IF (realsize.NE.4 .AND. realsize.NE.8) THEN PRINT *,'Your answer ',realsize,' does not make sense!' PRINT *,'Try again!' PRINT *,'Please enter the number of Bytes per ', $ 'REAL number : ' READ (*,FMT=*) realsize END IF PRINT *,'You have manually entered a size of ',realsize, $ ' bytes per REAL number' WRITE (*,FMT='(a)') '--------------------------------------' END SUBROUTINE dummy(q,r) C .. Scalar Arguments .. REAL q,r C .. C .. Intrinsic Functions .. INTRINSIC cos C .. r = cos(q) RETURN END From rsw@hydra.maths.unsw.EDU.AU Thu Apr 29 21:30:14 1993 Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA07518; Thu, 29 Apr 93 21:30:14 -0400 Received: by hydra.maths.unsw.EDU.AU (5.61/1.35) id AA01867; Fri, 30 Apr 93 11:35:27 +1000 Date: Fri, 30 Apr 93 11:35:27 +1000 From: rsw@hydra.maths.unsw.EDU.AU Message-Id: <9304300135.AA01867@hydra.maths.unsw.EDU.AU> To: mccalpin Subject: CM5 Stream_d results Status: RO STREAM: Measure memory transfer rates in MB/s for simple computational kernels in Fortran SunOS Release 4.1.2 (CMGENERIC) #14: CMOST Version 7.2 beta1.1-P2: Thu Jan 21 14 :15:00 EST 1993 cmf -O -o s_do -implicit_none stream_d.fcm cmf [CM5 VecUnit 2.1 Beta 0] CALL CMF_describe_array(a) Descriptor address : 78b0c desc_or_obj_kind : array argument debug info: element_type : double float spare1 : 0 spare2 : 0 home : cm cm_location : 1343432712 initial_data : ffffffff user_rank : 1 axes_extents : 5000000 axes_layout_maps : 1 is_modified? : no array_geometry : 93c10 spare6 : ffffffff geometry_rank : 1 geometry_offsets : 0 axes_extents_ptr : 78e2c axes_maps_ptr : 78db8 geometry_offsets_ptr : 78db4 debug_info_ptr : 0 view_or_thread_ptr : ffffffff is_slicewise : 1 element_size : 8 Array geometry id: 0x93c10 Rank: 1 Number of elements: 5000000 Extents: [5000000] Machine geometry id: 0x93bb0, rank: 1, column major Machine geometry elements: 5000192 Overall subgrid size: 78128 Axis 0: Extent: 5000192 (64 physical x 78128 subgrid) Off-chip: 6 bits, mask = 0x3f Subgrid: length = 78128, axis-increment = 1 CM5 with partition of 16 processors ( 64 vector units ) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Vector length = 5000000 Timing calibration: Time = 2.826233333333334 hundredths of a second Increase the size of the arrays if this is < 30 and your clock precision is =< 1/100 second --------------------------------------------------------- Function : Rate (MB/s) RMS time Min time Max time Assignment: 4881.2055 0.0164 0.0164 0.0164 Scaling : 4894.5720 0.0164 0.0163 0.0164 Summing : 7331.8805 0.0164 0.0164 0.0164 SAXPYing : 7333.7272 0.0164 0.0164 0.0164 From rsw@hydra.maths.unsw.EDU.AU Thu Apr 29 21:30:42 1993 Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA07522; Thu, 29 Apr 93 21:30:42 -0400 Received: by hydra.maths.unsw.EDU.AU (5.61/1.35) id AA01969; Fri, 30 Apr 93 11:35:53 +1000 Date: Fri, 30 Apr 93 11:35:53 +1000 From: rsw@hydra.maths.unsw.EDU.AU Message-Id: <9304300135.AA01969@hydra.maths.unsw.EDU.AU> To: mccalpin Subject: CM5 Stream_s results Status: RO STREAM: Measure memory transfer rates in MB/s for simple computational kernels in Fortran SunOS Release 4.1.2 (CMGENERIC) #14: CMOST Version 7.2 beta1.1-P2: Thu Jan 21 14 :15:00 EST 1993 cpio% cmf -O -o s_so -implicit_none stream_s.fcm cmf [CM5 VecUnit 2.1 Beta 0] CALL CMF_describe_array(a) Descriptor address : 7aaac desc_or_obj_kind : array argument debug info: element_type : float spare1 : 0 spare2 : 0 home : cm cm_location : 1342807592 initial_data : ffffffff user_rank : 1 axes_extents : 5000000 axes_layout_maps : 1 is_modified? : no array_geometry : 95fd0 spare6 : ffffffff geometry_rank : 1 geometry_offsets : 0 axes_extents_ptr : 7adcc axes_maps_ptr : 7ad58 geometry_offsets_ptr : 7ad54 debug_info_ptr : 0 view_or_thread_ptr : ffffffff is_slicewise : 1 element_size : 4 Array geometry id: 0x95fd0 Rank: 1 Number of elements: 5000000 Extents: [5000000] Machine geometry id: 0x95f70, rank: 1, column major Machine geometry elements: 5000192 Overall subgrid size: 78128 Axis 0: Extent: 5000192 (64 physical x 78128 subgrid) Off-chip: 6 bits, mask = 0x3f Subgrid: length = 78128, axis-increment = 1 CM5 with partition of 16 processors ( 64 vector units ) -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Vector length = 5000000 Timing calibration: Time = 6.677066 hundredths of a second Increase the size of the arrays if this is < 30 and your clock precision is = < 1/100 second --------------------------------------------------------- Function : Rate (MB/s) RMS time Min time Max time Assignment: 1420.9683 0.0282 0.0281 0.0295 Scaling : 1422.6085 0.0282 0.0281 0.0295 Summing : 2133.0415 0.0281 0.0281 0.0282 SAXPYing : 2133.1265 0.0282 0.0281 0.0295 From rsw@hydra.maths.unsw.EDU.AU Mon May 3 07:43:06 1993 Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA14801; Mon, 3 May 93 07:43:06 -0400 Received: by hydra.maths.unsw.EDU.AU (5.61/1.35) id AA15342; Mon, 3 May 93 21:48:33 +1000 Date: Mon, 3 May 93 21:48:33 +1000 From: rsw@hydra.maths.unsw.EDU.AU Message-Id: <9305031148.AA15342@hydra.maths.unsw.EDU.AU> To: mccalpin Subject: Re: CM5 Stream_d results Status: RO The man page states the CM5 timers have microsecond precision. I tried a few other vector lengths, and the results are remarkably consistent. The runs were made when I had the partition virtually to myself. I have had varying results timing operations that involved communications, but not these. Would like to see results from some other CM5 sites, but Australia only has 32 PN machines. Rob Compiled with NO optimization: cmf -implicit_none stream_d.fcm STREAM: Measure memory transfer rates in MB/s for simple computational kernels in Fortran CALL CMF_describe_array(a) desc_or_obj_kind : array argument element_type : double float home : cm user_rank : 1 axes_extents : 12800000 axes_layout_maps : 1 element_size : 8 Array geometry id: 0x93b28 Rank: 1 Number of elements: 12800000 Extents: [12800000] Machine geometry id: 0x93ac8, rank: 1, column major Machine geometry elements: 12800000 Overall subgrid size: 200000 Axis 0: Extent: 12800000 (64 physical x 200000 subgrid) Off-chip: 6 bits, mask = 0x3f Subgrid: length = 200000, axis-increment = 1 CM5 with partition of 16 processors ( 64 vector units ) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Vector length = 12800000 Timing calibration: Time = 7.229106060606060 hundredths of a second Increase the size of the arrays if this is < 30 and your clock precision is =< 1/100 second --------------------------------------------------------- Function : Rate (MB/s) RMS time Min time Max time Assignment: 4885.82465 0.04206 0.04192 0.04332 Scaling : 4894.24551 0.04185 0.04185 0.04187 Summing : 5129.65827 0.05996 0.05989 0.06128 SAXPYing : 5129.20148 0.05997 0.05989 0.06130 =================================================================== Compiled with Optimization: cmf -O -implicit_none stream_d.fcm STREAM: Measure memory transfer rates in MB/s for simple computational kernels in Fortran CALL CMF_describe_array(a) desc_or_obj_kind : array argument element_type : double float home : cm user_rank : 1 axes_extents : 6400000 axes_layout_maps : 1 element_size : 8 Array geometry id: 0x93c10 Rank: 1 Number of elements: 6400000 Extents: [6400000] Machine geometry id: 0x93bb0, rank: 1, column major Machine geometry elements: 6400000 Overall subgrid size: 100000 Axis 0: Extent: 6400000 (64 physical x 100000 subgrid) Off-chip: 6 bits, mask = 0x3f Subgrid: length = 100000, axis-increment = 1 CM5 with partition of 16 processors ( 64 vector units ) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Vector length = 6400000 Timing calibration: Time = 3.616887878787879 hundredths of a second Increase the size of the arrays if this is < 30 and your clock precision is =< 1/100 second --------------------------------------------------------- Function : Rate (MB/s) RMS time Min time Max time Assignment: 4882.29863 0.02098 0.02097 0.02101 Scaling : 4894.68150 0.02092 0.02092 0.02093 Summing : 7333.00445 0.02095 0.02095 0.02096 SAXPYing : 7335.86990 0.02094 0.02094 0.02099 From rsw@hydra.maths.unsw.EDU.AU Wed May 5 18:44:42 1993 Received: from hydra.maths.unsw.EDU.AU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA23379; Wed, 5 May 93 18:44:42 -0400 Received: by hydra.maths.unsw.EDU.AU (5.61/1.35) id AA06281; Thu, 6 May 93 08:50:19 +1000 Date: Thu, 6 May 93 08:50:19 +1000 From: rsw@hydra.maths.unsw.EDU.AU Message-Id: <9305052250.AA06281@hydra.maths.unsw.EDU.AU> To: mccalpin Subject: Re: CM5 Stream_d results Status: RO John The optimized results for the stream benchmark on a CM5 do look funny. I changed the constant 3.0D0 in the SAXPY operation to DBLE(k), where k is the loop index, and this inhibited whatever optimization was going on, to produce the results below. I also noticed that the unoptimized code could handle much larger problems (200,000 elements per VU) while the optimized code ran out of memory well before then (an extra temporary) Will get, and forward assembly listing of just the SAXPY loop (if you are interested). Have you go any other info on how stream runs on a CM5? For large enough vectors I would expect it to scale well to larger machines. Rob ============================================================================= NO Optimization: cmf stream_d.fcm SAXPY with constant from loop index s = DBLE(k) CALL CM_timer_clear(4) CALL CM_timer_start(4) c = a + s * b CALL CM_timer_stop(4) t = CM_timer_read_elapsed(4) times(4,k) = t STREAM: Measure memory transfer rates in MB/s for simple computational kernels in Fortran CM5 with partition of 16 processors ( 64 vector units ) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Array length = 9600000 Elements per VU = 150000 Timing calibration: Time = 5.455848484848485 hundredths of a second Increase the size of the arrays if this is < 30 and your clock precision is =< 1/100 second --------------------------------------------------------- Function : Rate (MB/s) RMS time Min time Max time Assignment: 4887.33331 0.03143 0.03143 0.03145 Scaling : 4893.41038 0.03153 0.03139 0.03276 Summing : 5145.05312 0.04500 0.04478 0.04615 SAXPYing : 5143.42075 0.04501 0.04480 0.04616 ======================================================================= Optimization ON: cmf -O stream_d.fcm STREAM: Measure memory transfer rates in MB/s for simple computational kernels in Fortran CM5 with partition of 16 processors ( 64 vector units ) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Array length = 9600000 Elements per VU = 150000 Timing calibration: Time = 5.468930303030303 hundredths of a second Increase the size of the arrays if this is < 30 and your clock precision is =< 1/100 second --------------------------------------------------------- Function : Rate (MB/s) RMS time Min time Max time Assignment: 4894.00569 0.03139 0.03139 0.03142 Scaling : 4898.13904 0.03144 0.03136 0.03273 Summing : 7328.59745 0.03158 0.03144 0.03280 SAXPYing : 5143.05891 0.04494 0.04480 0.04615 From derek_robb@corwin.cray.com Thu May 6 14:53:49 1993 Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA26030; Thu, 6 May 93 14:53:49 -0400 Received: from corwin.cray.com by bach.udel.edu with SMTP (5.65c/IDA-1.2.8) id AA12931; Thu, 6 May 1993 14:57:58 -0400 Message-Id: <199305061857.AA12931@bach.udel.edu> Date: 6 May 93 12:56:23 U From: "Derek Robb" Subject: STREAM Benchmark To: mccalpin@bach.udel.edu Status: RO Subject: Time:12:57 PM OFFICE MEMO STREAM Benchmark Date:5/6/93 I received a fax copy of your STREAM benchmark results for about 100 systems from Robert Bell at CSIRO. I would like to have these in digital form. Would you be kind enough to e-mail these to me. Thanks, Derek Robb From morse@mprgate.mpr.ca Fri May 7 11:31:13 1993 Received: from [134.87.131.13] by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA29121; Fri, 7 May 93 11:31:13 -0400 Received: from quark.mpr.ca by mprgate.mpr.ca with SMTP id AA12379 (5.65c/IDA-1.4.4 for ); Fri, 7 May 1993 08:35:18 -0700 Received: by quark.mpr.ca (5.57/Ultrix3.0-C) id AA17608; Fri, 7 May 93 08:35:15 -0700 Date: Fri, 7 May 93 08:35:15 -0700 From: morse@mprgate.mpr.ca (Daryl Morse) Message-Id: <9305071535.AA17608@quark.mpr.ca> To: mccalpin (John D. McCalpin) In-Reply-To: mccalpin@perelandra.cms.udel.edu's message of Thu, 6 May 1993 16:53:08 GM Subject: qgbox on HP9000/755 ? Status: RO Newsgroups: comp.benchmarks From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Nntp-Posting-Host: perelandra.cms.udel.edu Organization: College of Marine Studies, U. Del. Distribution: usa Date: Thu, 6 May 1993 16:53:08 GMT > I am trying to update the results table for a benchmark code of > mine (qgbox) and do not have access to one of the newer HP9000/7xx > machines, like the 735/755 models. > It hardly seems fair for me to be quoting IBM RS/6000-580 and > DEC 3000/500 results against the older HP 9000/730's. Actually, if you really want to make it fair, you should try to get numbers for DEC's 3000/500X, which is significantly faster than the 3000/500. If not the undiscounted price of the machine the tests were run on, providing an indication of which machines are expensive high-end server class machines, which are desktop, etc., would make your numbers all the more interesting. Comparing servers against servers and workstations against workstations is the most fair way to go. I point this out, because my suspicions tell me that the IBM RS6000 up near the top of the heap is a very pricey server. (If I am wrong about that, please disregard my suspicion.) Please don't take my comments personally. The numbers are definitely of interest. Thanks. Daryl Morse | Voice : (604) 293-5476 MPR Teltech Ltd. | Fax : (604) 293-5787 8999 Nelson Way, Burnaby, BC | E-Mail : morse@mpr.ca Canada, V5A 4B5 | : mprgate.mpr.ca!morse@uunet.uu.net From desj@ccr-p.ida.org Sun May 9 16:20:30 1993 Received: from idacrd.ccr-p.ida.org by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA05590; Sun, 9 May 93 16:20:30 -0400 Received: from runner.ida.org (runner.ccr-p.ida.org) by ccr-p.ida.org (4.1/SMI-4.1) id AA26226; Sun, 9 May 93 16:24:50 EDT Date: Sun, 9 May 93 16:24:50 EDT From: desj@ccr-p.ida.org (David desJardins) Message-Id: <9305092024.AA26226@ccr-p.ida.org> Received: by runner.ida.org (4.1/SMI-4.1) id AA25807; Sun, 9 May 93 16:24:49 EDT To: mccalpin Subject: Re: SPECint92, and other benchmarks (really LINPACK) Newsgroups: comp.sys.sgi.hardware,comp.benchmarks In-Reply-To: References: <1993Apr13.225551.26609@texhrc.uucp> <1rlmriINNq9q@usenet.pa.dec.com> Organization: IDA Center for Communications Research, Princeton Cc: Status: RO In article you write: >The current winner for aggregate bandwidth is the Cray C90, which can >sustain 105 GB/s from its shared main memory to/from its 16 cpus. >The Thinking Machines CM-5 is a potential competitor here, but I have >not been able to get results off of one yet.... A 1024 node CM-5 could sustain 90% or more of its peak bandwidth, which is 1024 * 4 * 16M * 8 = 512 GB/s. (It can run at over 99% of that, if you are doing something especially trivial like adding up the entries in a very large array.) Of course this is just between the vector units and their local memories. David desJardins From tuna@lhotse.LCS.MIT.EDU Tue May 11 18:55:49 1993 Received: from LHOTSE.LCS.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA12494; Tue, 11 May 93 18:55:49 -0400 Received: by lhotse.LCS.MIT.EDU id AA28381; Tue, 11 May 93 19:00:17 -0400 Date: Tue, 11 May 93 19:00:17 -0400 From: tuna@lhotse.LCS.MIT.EDU (Kirk 'UhOh' Johnson) Message-Id: <9305112300.AA28381@lhotse.LCS.MIT.EDU> To: mccalpin Subject: stream results for SS10/30 Reply-To: Kirk Johnson Status: RO i noticed your stream results don't include any measurements for non-supercache SS10 systems, so i went ahead and compiled things up on one of our SS10/30s and ran some measurements. the specific details are: - SS10/30 (36 MHz, 20/16 kbyte on-chip I/D caches) - SunOS 4.1.3 - compiled with "f77 -O4 -Bstatic" (using whatever version of f77 we got from sun at the same time they shipped us sun C 1.0; it's almost cetainly not the latest and greatest FORTRAN compiler from sun) results from five consecutive runs of "stream_s": -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 255.000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 33.8984 0.5910 0.5900 0.6000 Scaling : 41.6667 0.4820 0.4800 0.4900 Summing : 38.9610 0.7790 0.7700 0.7800 SAXPYing : 33.7079 0.8910 0.8900 0.9000 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 252.000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 33.8983 0.5940 0.5900 0.6000 Scaling : 41.6667 0.4840 0.4800 0.4900 Summing : 38.9610 0.7780 0.7700 0.7800 SAXPYing : 34.0909 0.8880 0.8800 0.8900 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 251.000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 33.8983 0.5930 0.5900 0.6100 Scaling : 41.6667 0.4820 0.4800 0.4900 Summing : 38.9610 0.7780 0.7700 0.7800 SAXPYing : 33.7079 0.8900 0.8900 0.8900 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 251.000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 33.8983 0.5930 0.5900 0.6100 Scaling : 41.6667 0.4850 0.4800 0.4900 Summing : 38.9610 0.7810 0.7700 0.7900 SAXPYing : 34.0909 0.8870 0.8800 0.8900 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 252.000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 33.8983 0.5930 0.5900 0.6100 Scaling : 41.6667 0.4820 0.4800 0.4900 Summing : 38.9611 0.7810 0.7700 0.7900 SAXPYing : 34.0909 0.8900 0.8800 0.9000 results from five consecutive runs of "stream_d": -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 303.99999339134 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 42.1053 0.5790 0.5700 0.5900 Scaling : 46.1538 0.5280 0.5200 0.5400 Summing : 45.5697 0.7910 0.7900 0.8000 SAXPYing : 46.1539 0.7880 0.7800 0.7900 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 302.99999434501 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 42.1053 0.5790 0.5700 0.5800 Scaling : 46.1539 0.5310 0.5200 0.5500 Summing : 46.1539 0.7870 0.7800 0.7900 SAXPYing : 46.1539 0.7880 0.7800 0.7900 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 301.99999436736 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 42.1053 0.5800 0.5700 0.5900 Scaling : 46.1539 0.5281 0.5200 0.5500 Summing : 46.1538 0.7890 0.7800 0.8000 SAXPYing : 46.1539 0.7870 0.7800 0.7900 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 301.99999623001 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 42.1053 0.5790 0.5700 0.5800 Scaling : 46.1540 0.5300 0.5200 0.5400 Summing : 46.1539 0.7870 0.7800 0.7900 SAXPYing : 46.1538 0.7870 0.7800 0.7900 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 302.99999527633 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 42.1053 0.5780 0.5700 0.5800 Scaling : 46.1540 0.5291 0.5200 0.5500 Summing : 46.1539 0.7880 0.7800 0.7900 SAXPYing : 46.1539 0.7880 0.7800 0.7900 share and enjoy, kirk From tuna@kanchenjunga.LCS.MIT.EDU Wed May 12 13:17:46 1993 Received: from KANCHENJUNGA.LCS.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA15631; Wed, 12 May 93 13:17:46 -0400 Received: by kanchenjunga.LCS.MIT.EDU id AA06002; Wed, 12 May 93 13:22:16 -0400 Date: Wed, 12 May 93 13:22:16 -0400 From: tuna@kanchenjunga.LCS.MIT.EDU (Kirk 'UhOh' Johnson) Message-Id: <9305121722.AA06002@kanchenjunga.LCS.MIT.EDU> To: mccalpin Subject: stream results for SS10/30 Reply-To: Kirk Johnson Status: RO thanks for the results. I will put them in the table today.... sure. we've also got two of the 50 MHz + 1 MB external cache processor upgrade modules on order; when they show up here (late june?), i'll run the same measurements on them as well ... kirk From cmg@ferrari.cray.com Thu May 27 18:59:37 1993 Received: from timbuk.cray.com by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA26093; Thu, 27 May 93 18:59:37 -0400 Received: from magnet (magnet.cray.com) by cray.com (4.1/CRI-MX 2.19) id AA18038; Thu, 27 May 93 18:00:51 CDT Received: by magnet (4.1/CRI-5.13) id AA04064; Thu, 27 May 93 18:00:46 CDT From: cmg@ferrari.cray.com (Charles Grassl) Message-Id: <9305272300.AA04064@magnet> Subject: stream To: mccalpin Date: Thu, 27 May 93 18:00:42 CDT X-Mailer: ELM [version 2.3 PL11] Status: RO Hello John; Below are performance results for running the STREAM benchmark on an 8 CPU CRAY Y-MP EL98. If you are maintaining and distributing a list with STREAM results, could you please add the EL results to your list. The CRAY Y-MP EL98 has up to 8 CPUs which run at 33.3 MHz. Each CPU has four floating point functional units. Each functional unit can produce one floating point add or (exclusive) one floating point multiply per clock period. Pairs of EL CPUs share four memory ports. As you can see from the data, the memory ports can sustain up to four CPUs. For eight CPUs, the Assignment, Scaling and Summing loops run much faster than for 4 CPUs because the additional CPUs use ports C and D. For SAXPYing, the additional CPUs only have use of port D. (Actually, the pairs of CPUs share all four ports.) This benchmark is an interesting experiment for this memory architecture! John, I'm trying to figure out a way to characterize computer power by a "weighted" memory bandwidth. More memory "closer", or faster, would count more than memory "farther", or slower. My idea is to calculate the second moment of memory defined as: __ Memory Mement (MM) = \ memory words * (rate)^2 / -- The MM would have units of [ Mbyte * (Mbyte/s)^2 ] For the rate, I could use the transfer rate measured by STREAM. The measurement of MM would entale running STREAM for all sizes of memory and "integrating" or summing the contributions. Does this look reasonable to you? Would MM have any predictive powers for computer performance?. One other question: What would be analogous to torque for the above "moment of inertia"? Regards, -- Charles Grassl Cray Research, Inc. cmg@cray.com (612) 683-3531 ==================================================== 1 CPUs ==================================================== -------------------------------------- Single precision appears to have 14 digits of accuracy Assuming 8 bytes per default REAL word -------------------------------------- Timing calibration ; time = 31.968054 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 437.2382 0.1098 0.1098 0.1098 Scaling : 436.6548 0.1106 0.1099 0.1113 Summing : 536.1533 0.1346 0.1343 0.1349 SAXPYing : 476.7855 0.1510 0.1510 0.1510 STOP (called by STREAM ) CP: 1.345s, Wallclock: 1.349s, 12.5% of 8-CPU Machine HWM mem: 9129405, HWM stack: 9004265, Stack overflows: 0 ==================================================== 2 CPUs ==================================================== -------------------------------------- Single precision appears to have 14 digits of accuracy Assuming 8 bytes per default REAL word -------------------------------------- Timing calibration ; time = 15.945318 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 826.6837 0.0581 0.0581 0.0582 Scaling : 833.7715 0.0576 0.0576 0.0576 Summing : 1048.9873 0.0686 0.0686 0.0687 SAXPYing : 1078.1763 0.0668 0.0668 0.0669 STOP (called by STREAM ) CP: 1.354s, Wallclock: 1.076s, 15.7% of 8-CPU Machine HWM mem: 9129405, HWM stack: 9004265, Stack overflows: 0 ==================================================== 4 CPUs ==================================================== -------------------------------------- Single precision appears to have 14 digits of accuracy Assuming 8 bytes per default REAL word -------------------------------------- Timing calibration ; time = 8.152287 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 1564.9253 0.0307 0.0307 0.0307 Scaling : 1569.8418 0.0306 0.0306 0.0307 Summing : 1933.8354 0.0373 0.0372 0.0374 SAXPYing : 1955.4611 0.0369 0.0368 0.0369 STOP (called by STREAM ) CP: 1.440s, Wallclock: 0.772s, 23.3% of 8-CPU Machine HWM mem: 9139645, HWM stack: 9004265, Stack overflows: 0 ==================================================== 8 CPUs ==================================================== -------------------------------------- Single precision appears to have 14 digits of accuracy Assuming 8 bytes per default REAL word -------------------------------------- Timing calibration ; time = 5.824599 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 2362.8129 0.0244 0.0203 0.0279 Scaling : 2310.5194 0.0249 0.0208 0.0284 Summing : 2373.6665 0.0312 0.0303 0.0321 SAXPYing : 2363.7914 0.0305 0.0305 0.0305 STOP (called by STREAM ) CP: 2.011s, Wallclock: 1.160s, 21.7% of 8-CPU Machine HWM mem: 9149885, HWM stack: 9004265, Stack overflows: 0 From csrcb@mel.dit.csiro.au Tue Jun 22 22:04:39 1993 Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA16894; Tue, 22 Jun 93 22:04:39 -0400 Received: from shark.mel.dit.CSIRO.AU by bach.udel.edu with SMTP (5.65c/IDA-1.2.8) id AA19731; Tue, 22 Jun 1993 22:07:13 -0400 Received: by shark.mel.dit.csiro.au id AA17947 (5.65c/IDA-1.4.4/DIT-1.3 for mccalpin@bach.udel.edu); Wed, 23 Jun 1993 12:07:27 +1000 From: Robert Bell Message-Id: <199306230207.AA17947@shark.mel.dit.csiro.au> Subject: Stream Benchmark To: mccalpin@bach.udel.edu Date: Wed, 23 Jun 93 12:07:26 EST X-Mailer: ELM [version 2.3 PL11] Status: RO John, I have been following your stream benchmark with some interest. A few years ago, I developed an interest in measuring the performance of computer memory systems, and have some codes which illustrate and measure characteristics and performance. I worked with Charles Grassl from Cray on these codes some time ago. Amyway, I have just downloaded a copy of the latest summary of the results, and have a query about the Y-MP EL 1 cpu results for the Triad benchmark. Is the stated figure of 476.8 correct? If so, there is a curiosity in that the 2 cpu result is more than twice as fast, which is hard to explain. Is there a misprint, with the true figure being 576.8 ? This would be more consistent with the ^ other Y-MP EL and Y-MP results. Thanks Rob. Bell ( email: csrcb@mel.dit.csiro.au ) -- / Robert.Bell@mel.dit.csiro.au | CSIRO Supercomputing Facility Manager \ | CSIRO Division of Information Technology, Supercomputing Support Group | | 723 Swanston Street | tel: +61 3 282 2620 or +61 018 108 333 | \ Carlton VIC 3053 Australia | fax: +61 3 282 2600 / From tuna@spica.LCS.MIT.EDU Wed Jun 30 10:33:08 1993 Received: from SPICA.LCS.MIT.EDU by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA01755; Wed, 30 Jun 93 10:33:08 -0400 Received: by spica.LCS.MIT.EDU id AA00480; Wed, 30 Jun 93 10:35:47 -0400 Date: Wed, 30 Jun 93 10:35:47 -0400 From: tuna@spica.LCS.MIT.EDU (Kirk 'UhOh' Johnson) Message-Id: <9306301435.AA00480@spica.LCS.MIT.EDU> To: mccalpin Subject: stream results for SS10/51 Reply-To: Kirk Johnson Status: RO here are results for your stream benchmarks running on an SS10/51 (50 MHz SuperSparc, 1 MB external cache), compiled with "f77 -O" using whatever version of fortran sun shipped along with their sun C 1.0 product. -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 55.0000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 44.4444 0.1011 0.0900 0.1100 Scaling : 40.0000 0.1031 0.1000 0.1100 Summing : 42.8572 0.1471 0.1400 0.1500 SAXPYing : 42.8572 0.1461 0.1400 0.1500 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 55.0000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 40.0000 0.1041 0.1000 0.1100 Scaling : 40.0000 0.1041 0.1000 0.1100 Summing : 42.8571 0.1481 0.1400 0.1500 SAXPYing : 42.8572 0.1431 0.1400 0.1500 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 56.0000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 44.4446 0.1032 0.0900 0.1100 Scaling : 40.0000 0.1031 0.1000 0.1100 Summing : 42.8572 0.1471 0.1400 0.1500 SAXPYing : 42.8572 0.1461 0.1400 0.1500 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 60.999995470047 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 43.6365 0.1221 0.1100 0.1300 Scaling : 43.6363 0.1171 0.1100 0.1200 Summing : 42.3530 0.1780 0.1700 0.1800 SAXPYing : 42.3530 0.1720 0.1700 0.1800 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 59.999997913837 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 40.0000 0.1241 0.1200 0.1300 Scaling : 43.6363 0.1190 0.1100 0.1200 Summing : 42.3530 0.1771 0.1700 0.1800 SAXPYing : 42.3529 0.1780 0.1700 0.1800 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 61.000002920628 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 40.0000 0.1251 0.1200 0.1300 Scaling : 43.6365 0.1161 0.1100 0.1200 Summing : 42.3530 0.1771 0.1700 0.1800 SAXPYing : 42.3529 0.1731 0.1700 0.1800 share-n-enjoy, kirk From news.udel.edu!darwin.sura.net!howland.reston.ans.net!ux1.cso.uiuc.edu!sdd.hp.com!decwrl!pa.dec.com!uvo.dec.com!helles.unt.dec.com!ryn.mro4.dec.com!msbcs.enet.dec.com!bhandarkar Thu Jul 15 16:59:45 EDT 1993 Article: 41327 of comp.arch Newsgroups: comp.arch Path: news.udel.edu!darwin.sura.net!howland.reston.ans.net!ux1.cso.uiuc.edu!sdd.hp.com!decwrl!pa.dec.com!uvo.dec.com!helles.unt.dec.com!ryn.mro4.dec.com!msbcs.enet.dec.com!bhandarkar From: bhandarkar@msbcs.enet.dec.com (Dileep Bhandarkar) Subject: Re: Looking for info. on DEC 3000 AXP 400 Message-ID: Sender: news@ryn.mro4.dec.com (USENET News System) Organization: Digital Equipment Corporation References: <1993Jul13.174722.10241@eecs.nwu.edu> Date: Thu, 15 Jul 1993 20:39:23 GMT Lines: 13 Status: RO In article <1993Jul13.174722.10241@eecs.nwu.edu>, shil@nasser.eecs.nwu.edu (Lei Shi) writes... >Hello: > > I am looking for the following information about the DEC 3000 AXP model 400. > My questions are: >1. What are the read miss penalty and write miss penalty in terms of the CPU >cycles for the first level cache if the data is in the second level cache? >2. What are the read miss penalty and write miss penalty in terms of the CPU >cycles for the second level cache if the data is in the main memory? > The second level cache access time is 5 cycles. Memory access time is 27 cycles. Clock rate is 133 MHz. From news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno Fri Jul 16 09:44:00 EDT 1993 Article: 41343 of comp.arch Newsgroups: comp.arch Path: news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno From: maeno@is.titech.ac.jp (Toshinori Maeno) Subject: Re: Looking for info. on DEC 3000 AXP 400 References: <1993Jul13.174722.10241@eecs.nwu.edu> Message-ID: <1993Jul16.104143.27476@is.titech.ac.jp> Date: Fri, 16 Jul 1993 10:41:43 GMT Organization: Dept. of Information Science, Tokyo Institute of Technology, Tokyo, JAPAN X-Bytes: 1060 Lines: 27 Status: RO In article bhandarkar@msbcs.enet.dec.com (Dileep Bhandarkar) writes: > >In article <1993Jul13.174722.10241@eecs.nwu.edu>, shil@nasser.eecs.nwu.edu (Lei Shi) writes... >> I am looking for the following information about the DEC 3000 AXP model 400. >> My questions are: >>1. What are the read miss penalty and write miss penalty in terms of the CPU >>cycles for the first level cache if the data is in the second level cache? >>2. What are the read miss penalty and write miss penalty in terms of the CPU >>cycles for the second level cache if the data is in the main memory? >> >The second level cache access time is 5 cycles. Memory access time is 27 cycles. >Clock rate is 133 MHz. My measurement for TITAN2-400 (Alpha 133MHz) tells, 1. read miss penalty is 8 cycles for read, 12 cycles for write for the first level cache when the data is in the second level cache. 2. read miss penalty is 12 cycles for read, 40 cycles for write when the data is only in the memory. Toshinori Maeno Tokyo Institute of Technology From news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno Fri Jul 16 09:44:12 EDT 1993 Article: 41344 of comp.arch Newsgroups: comp.arch Path: news.udel.edu!udel!wupost!usc!elroy.jpl.nasa.gov!decwrl!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno From: maeno@is.titech.ac.jp (Toshinori Maeno) Subject: Re: Looking for info. on DEC 3000 AXP 400 References: <1993Jul13.174722.10241@eecs.nwu.edu> <1993Jul16.104143.27476@is.titech.ac.jp> Message-ID: <1993Jul16.104701.27493@is.titech.ac.jp> Date: Fri, 16 Jul 1993 10:47:01 GMT Organization: Dept. of Information Science, Tokyo Institute of Technology, Tokyo, JAPAN X-Bytes: 597 Lines: 17 Status: RO Sorry for my mistake in my last posting. In article <1993Jul16.104143.27476@is.titech.ac.jp> maeno@is.titech.ac.jp (Toshinori Maeno) writes: >My measurement for TITAN2-400 (Alpha 133MHz) tells, > 1. read miss penalty is 8 cycles for read, 12 cycles for write for the == 34 is correct >first level cache when the data is in the second level cache. > > 2. read miss penalty is 12 cycles for read, 40 cycles for write when >the data is only in the memory. Toshinori Maeno Tokyo Institute of Technology From news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!decwrl!deccrl!news.crl.dec.com!stewart Fri Jul 16 22:15:58 EDT 1993 Article: 41359 of comp.arch Newsgroups: comp.arch Path: news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!swrinde!elroy.jpl.nasa.gov!decwrl!deccrl!news.crl.dec.com!stewart From: stewart@crl.dec.com (Larry Stewart) Subject: Re: Looking for info. on DEC 3000 AXP 400 Message-ID: <1993Jul16.211105.6132@crl.dec.com> Sender: news@crl.dec.com (USENET News System) Reply-To: stewart@crl.dec.com Organization: DEC Cambridge Research Lab References: <1993Jul13.174722.10241@eecs.nwu.edu> <1993Jul16.104143.27476@is.titech.ac.jp> Date: Fri, 16 Jul 1993 21:11:05 GMT Lines: 98 Status: RO In article , tremblay@flayout.Eng.Sun.COM (Marc Tremblay) writes: > In article <1993Jul16.104143.27476@is.titech.ac.jp> maeno@is.titech.ac.jp (Toshinori Maeno) writes: > > 2. read miss penalty is 12 cycles for read, 40 cycles for write when > >the data is only in the memory. > > Maybe someone from DEC can explain the discrepancy between the 12 cycles > claimed here and the 27 cycles that was claimed in a previous message > for a second level read miss. > > - Marc Tremblay. > Sun Microsystems. That's easy. The 12 cycle number is wrong. Dileep's message was accurate, but it is pretty easy to draw wrong conclusions from the numbers, and quite difficult to actually measure them. What the hardware does (read): 1 cycle 1st level cache access 5 cycle second level cache access 27 cycle main memory access What the programmer sees: The program keeps running after the LD is issued, and stalls only if the destination register of the LD is touched before the data gets there. Consequently, in order to measure the load latency, you have to do something like the following: 1) Perform a series of references to assure that the test reference will be a hit or miss in the appropriate cache. 2) Do an MB to make sure the write buffers are flushed. 3) Let the pin bus become idle, to assure your test reference is not stalled behind some other activity. 4) execute a test code sequence, consisting (typically) of: RCC ; read cycle counter LD ; test load instruction ADD ; some instruction to touch the result register RCC ; to read the cycle counter again. Of course it isn't that simple, since you have to know the align- ment of the instructions, and which ones will dual-issue with which. 5) Correct the result of the measurement for the effects of the RCC instructions. In fact, the 21064 can have two outstanding loads. The third load will stall, and there are some other wierd stall conditions, read the 21064 data book. The answers given by this measurement are (I think) 3-cycle latency for the first level cache (load-use penalty) and 8 for the external cache, because it takes a couple of cycles to get the pipes moving and to get the address to the pins, before you can start the cache access. The rep-rates are 1 cycle for the internal cache and 5 for the external. Measuring store latency is even harder. First you have to decide what it means, since the 21064 will store up to 128 bytes of write data in the write buffers before incuring ANY delays to the program. One thing it might mean is how long does it take before the cells in the DRAMS get new values. Who cares? Another thing it might mean is how long does it take to write a value and then read it back. (Pushing arguments and calling a procedure, which pops them might have performance limited by such a path.) I on the 3000/400 this path requires that the write data reach the external cache and then can be read back in via a second level cache hit. The performance of this path is on the order of 20 cycles, and can be measured using the cycle counter as above. Of course a good compiler might pass arguments in registers, since reading back something you've just written can be expensive on many modern machines. Writes to the secondary cache take longer than reads because the chip must make two cache accesses to do a write. The first access is a read to check the tag store and dirty bits. The second access writes the data and updates the dirty bit. If the relevant write buffer entry contains data from both halves of the 32 byte cache line, then the write will take three 5-cycle cache accesses, because the pin bus is only 16 bytes wide. So my view is that asking "what are the read and write latencies" is interesting, but can be simplistic, since you cannot use the answers to predict anything in particular. You cannot add read-latencies to calculate bandwidth, because the memory is delivering 32 bytes per access, not just what you asked for. In any case, the memory rep-rate is NOT the same as the latency. The situation for writes is more clear, write-latency is very nearly uninteresting, since it has almost nothing to do with performance. Write-bandwidth is interesting, and the fact that write activity limits the bandwidth available for reads is interesting, but who cares how long it takes? (other than a programmed I/O device driver.) -Larry Stewart -- Digital Equipment Corporation Cambridge Research Laboratory From news.udel.edu!udel!gatech!swrinde!elroy.jpl.nasa.gov!ames!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno Sat Jul 17 07:46:01 EDT 1993 Article: 41362 of comp.arch Newsgroups: comp.arch Path: news.udel.edu!udel!gatech!swrinde!elroy.jpl.nasa.gov!ames!koriel!sh.wide!wnoc-tyo-news!cs.titech!is.titech!maeno From: maeno@is.titech.ac.jp (Toshinori Maeno) Subject: Re: Looking for info. on DEC 3000 AXP 400 References: <1993Jul16.104143.27476@is.titech.ac.jp> Message-ID: <1993Jul17.024952.3137@is.titech.ac.jp> Date: Sat, 17 Jul 1993 02:49:52 GMT Organization: Dept. of Information Science, Tokyo Institute of Technology, Tokyo, JAPAN X-Bytes: 1186 Lines: 27 Status: RO In article tremblay@flayout.Eng.Sun.COM (Marc Tremblay) writes: >In article <1993Jul16.104143.27476@is.titech.ac.jp> maeno@is.titech.ac.jp (Toshinori Maeno) writes: >> 2. read miss penalty is 12 cycles for read, 40 cycles for write when >>the data is only in the memory. > >At 133 MHz, 12 cycles represent 90ns. Given the size of main memory >(the SPEC92 numbers were obtained on a machine with 128 MB), 90ns for >the miss processing and for main memory latency would put tough >constraints on the DRAMs. I suspect that 12 cycles are what is measured >from address out to data in and does not include the overhead for >the miss handling and bringing the data into the pipeline. > >Maybe someone from DEC can explain the discrepancy between the 12 cycles >claimed here and the 27 cycles that was claimed in a previous message >for a second level read miss. >Sun Microsystems. I am very sorry, I was confused and made a second mistake. >> 2. read miss penalty is 12 cycles for read, 40 cycles for write when == 33 was the measured cycles. >>the data is only in the memory. Toshinori Maeno From mccalpin@strauss.udel.edu Sat Sep 11 06:31:29 1993 Received: from strauss.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA04603; Sat, 11 Sep 93 06:31:27 -0400 Return-Path: Received: from localhost (mccalpin@localhost) by strauss.udel.edu (8.5/8.5) id GAA05081; Sat, 11 Sep 1993 06:33:53 -0400 Date: Sat, 11 Sep 1993 06:33:53 -0400 From: John D McCalpin Message-Id: <199309111033.GAA05081@strauss.udel.edu> To: mccalpin Subject: Stream on Sun/2000 Content-Length: 1631 Status: RO -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 592.05609671772 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 34.4371 0.9395 0.9292 0.9596 Scaling : 36.6877 0.8909 0.8722 0.9106 Summing : 35.2337 1.3788 1.3623 1.3948 SAXPYing : 35.0374 1.3771 1.3700 1.3870 Note: this program was linked with -fast or -fnonstd and so may have produced nonstandard floating-point results. Sun's implementation of IEEE arithmetic is discussed in the Numerical Computation Guide. real 52.57 user 48.14 sys 3.68 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 132.00789839029 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 36.6677 0.2223 0.2182 0.2462 Scaling : 34.6741 0.2327 0.2307 0.2355 Summing : 37.3276 0.3243 0.3215 0.3299 SAXPYing : 35.6657 0.3395 0.3365 0.3452 From mccalpin@cacr Ukn Sep 20 10:02:11 1993 Received: from cacr.coastal.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA02591; Mon, 20 Sep 93 10:02:10 -0400 Return-Path: Received: by cacr (920330.SGI/920502.SGI.AUTO) for mccalpin@perelandra.cms.udel.edu id AA10035; Mon, 20 Sep 93 10:10:28 -0400 Date: Mon, 20 Sep 93 10:10:28 -0400 From: mccalpin@cacr (John D. McCalpin) Message-Id: <9309201410.AA10035@cacr> To: mccalpin Subject: stream_d on Indigo R4000 Status: RO X-Status: make stream_d f77 -O -mips2 stream_d.f -o stream_d /usr/bin/timex stream_d -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 94.99999694526196 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 55.1724 0.3442 0.2900 0.5200 Scaling : 53.3333 0.3219 0.3000 0.3800 Summing : 52.1740 0.5025 0.4600 0.5800 SAXPYing : 53.3334 0.4926 0.4500 0.5700 real 22.82 user 15.22 sys 2.31 From news.udel.edu!udel!wupost!howland.reston.ans.net!pipex!uknet!pavo.csi.cam.ac.uk!cast0.ast.cam.ac.uk!drtr Fri Oct 8 10:35:29 EDT 1993 Article: 43061 of comp.arch Newsgroups: comp.arch Path: news.udel.edu!udel!wupost!howland.reston.ans.net!pipex!uknet!pavo.csi.cam.ac.uk!cast0.ast.cam.ac.uk!drtr From: drtr@mail.ast.cam.ac.uk (David Robinson) Subject: Re: SPARCstation memory performance??? Message-ID: <1993Oct8.094903.4366@infodev.cam.ac.uk> Sender: news@infodev.cam.ac.uk (USENET news) Nntp-Posting-Host: coral.ast.cam.ac.uk Organization: Institute of Astronomy, Cambridge References: Date: Fri, 8 Oct 1993 09:49:03 GMT Lines: 51 Status: RO In article , jhoe@au-bon-pain.lcs.mit.edu (James C. Hoe) writes: |> |> Has anyone done any experiment or have detail knowledge of memory |> performance on SPARC workstations (4/4*0, SS10). I am particularly |> interested in load and store on cache misses. |> |> I have done some experiments and found my 40MHz SPARCstation2 uses |> nearly 30 cycles on a load misses (I was told this was due to |> very slow memory translation). SPARCstation2 also require 8 cycles |> to store regardless of miss or hit. |> |> Have anyone done similar experiments, or know detail inner workings of |> these workstations, especially SS10's. I would appreciate any |> correspondence. For a SS10 with MCC and secondardy cache (10/41, 10/51 etc) the penalties are Load miss primary cache, hit secondary cache: 5 cycles Load miss primary cache, miss secondard cache: ~80 cycles For a SS10 without MCC, specifically 10/40, the penalty is Load miss primary cache ~15 cycles I suspect the penalty will be less for a 10/30, as cycles are longer. Store penalties are harder to measure, and less useful to know, as there is a four level store buffer. However, I have measured the peak store bandwidth: Timings done using the std instruction. All bandwidths in Mb/s I = on-chip (internal) or primary cache, E = off-chip (external) or secondary cache Clock speed given either as CPU speed or CPU speed/memory bus speed Hit I cache Miss I cache Miss I cache Hit E cache Miss E cache 10/40 308 - 46 40MHz, I=16Kb, E=0Kb 10/51 220 212 34 50MHz/40MHz, I=16Kb, E=1Mb 10/41 182 172 31 40.3MHz/40Mhz, I=16Kb, E=1Mb Classic 62 - 64 50Mhz, I=2kb, E=0Kb SS2 - 25 24 40MHz, I=0Kb, E=64Kb It is obvious that the 10/41 & 10/51 have a write-through internal cache, whereas the 10/40 has a write-back internal cache. For a 10/40, a store this hits the internal cache can be issued every cycle. For a 10/41 or 51, once the store buffer is filled, stores can be executed every 1 3/4 cycles, on average. For the SS2, 24Mb/s translates to 13.0 cycles for each std, or 6.5 cycles per word, so either storing is pipelined, or 8 cycles only applies to single word stores. David Robinson. (drtr@mail.ast.cam.ac.uk) From rothberg@SSD.intel.com Ukn Oct 14 16:58:35 1993 Received: from brahms.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA24426; Thu, 14 Oct 93 16:58:33 -0400 Return-Path: Received: from SSD.intel.com (ssd.intel.com [137.46.201.30]) by brahms.udel.edu (8.6.beta.11/8.6.beta.2) with SMTP id QAA10316 for ; Thu, 14 Oct 1993 16:56:30 -0400 From: rothberg@SSD.intel.com Received: from warthog.ssd.intel.com by SSD.intel.com (4.1/SMI-4.1) id AA07184; Thu, 14 Oct 93 13:56:28 PDT Message-Id: <9310142056.AA07184@SSD.intel.com> To: mccalpin@brahms.udel.edu Subject: STREAM benchmark Date: Thu, 14 Oct 93 13:56:27 -0700 Status: RO X-Status: I saw the table you posted a few weeks ago in comp.arch, comparing achieved memory bandwidths for several machines. Very interesting numbers. I was wondering whether you've got any data on the new IBM POWER2 machines. Also, is there any chance I could get a copy of the codes you used? I'd like to try running the test on a few other machines. I realize that they are all just 3-line programs, but it makes direct comparison easier if I know that I'm using the same source. Thanks, Ed Rothberg Intel Supercomputer Systems Division rothberg@ssd.intel.com From Renu.Raman@Eng.Sun.COM Tue Oct 26 20:54:12 1993 Received: from bach.udel.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA04709; Tue, 26 Oct 93 20:54:11 -0400 Return-Path: Received: from Sun.COM (Sun.COM [192.9.9.1]) by bach.udel.edu (8.6.beta.11/8.6.beta.2) with SMTP id UAA04876 for ; Tue, 26 Oct 1993 20:52:44 -0400 Received: from Eng.Sun.COM (zigzag.Eng.Sun.COM) by Sun.COM (4.1/SMI-4.1) id AA09136; Tue, 26 Oct 93 17:52:27 PDT Received: from shukra.Eng.Sun.COM by Eng.Sun.COM (4.1/SMI-4.1) id AA10081; Tue, 26 Oct 93 17:52:12 PDT Received: by shukra.Eng.Sun.COM (4.1/SMI-4.1) id AA05341; Tue, 26 Oct 93 17:52:25 PDT Date: Tue, 26 Oct 93 17:52:25 PDT From: Renu.Raman@Eng.Sun.COM (Renu Raman) Message-Id: <9310270052.AA05341@shukra.Eng.Sun.COM> To: mccalpin@bach.udel.edu Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy? Newsgroups: comp.benchmarks In-Reply-To: References: <2a7utr$lf5@nh1.u-aizu.ac.jp> Organization: Sun Cc: geomagic@seismo.do.usbr.gov Status: RO Update of SS10/41 > Bytes Bandwidth (MB/s) >Machine /word Copy Scale Sum Triad >--------------- ----- -------- -------- -------- -------- >Sun SS10/41 4 34.3 38.4 36.9 37.9 >Sun SS10/30 8 42.1 46.2 46.2 46.2 >Sun SS10/30 4 33.9 41.7 39.0 34.1 Sun SS10/41 8 48.0 48.0 54.0 54.0 Sun SS10/512 8 48.0 48.0 48.0 48.0 Obviously as the memory system does not scale, its about the same.... These numbers were obtained using the Apogee compilers... renu raman From hahn@neurocog.lrdc.pitt.edu Ukn Oct 27 12:03:18 1993 Received: from neurocog.lrdc.pitt.edu by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA05504; Wed, 27 Oct 93 12:03:16 -0400 Return-Path: Message-Id: <9310271603.AA05504@perelandra.cms.udel.edu> Received: by neurocog.lrdc.pitt.edu (1.37.109.4/16.2) id AA29337; Wed, 27 Oct 93 12:01:59 -0400 From: Mark Hahn Subject: stream in C To: mccalpin Date: Wed, 27 Oct 1993 12:01:58 -0500 (EDT) Cc: hahn@neurocog.lrdc.pitt.edu (Mark Hahn, lrdc 512, 6247063, 3633618) X-Mailer: ELM [version 2.4 PL21] Content-Type: text Content-Length: 4344 Status: RO X-Status: I had no luck finding the correct time/clock function for the little-used fortran on this machine, an hp 735, so I translated your stream.f into reasonably portable C. I'd appreciate it if you would look it over and verify that it's doing something comparable to the fortran version. my intent was to post the c version on comp.benchmarks. BTW, with this code, our 735 gets about 73 mb/s double saxpy. /* * Program: Stream * Programmer: John D. McCalpin * Revision: 2.0, September 30,1991 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * * INSTRUCTIONS: * 1) (fortran-specific, omitted.) * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the second line of the main * program to give a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonable good, on the * other hand, the optimizer might be too smart for me! * 4) Mail the results to mccalpin@perelandra.cms.udel.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * Thanks! * * this version was ported from fortran to c by mark hahn, hahn+@pitt.edu. */ #define N 1000000 #define NTIMES 10 #ifdef __hpux #define _HPUX_SOURCE 1 #else #define _INCLUDE_POSIX_SOURCE 1 #endif #include #include #include #include #ifndef MIN #define MIN(x,y) ((x)<(y)?(x):(y)) #endif #ifndef MAX #define MAX(x,y) ((x)>(y)?(x):(y)) #endif struct timeval tvStart; void utimeStart() { struct timezone tz; gettimeofday(&tvStart,&tz); } float utime() { struct timeval tv; struct timezone tz; float utime; gettimeofday(&tv,&tz); utime = 1e6 * (tv.tv_sec - tvStart.tv_sec) + tv.tv_usec - tvStart.tv_usec; if (tv.tv_usec < tvStart.tv_usec) utime += 1e6; return utime; } typedef double real; static real a[N],b[N],c[N]; int main() { int j,k; float times[4][NTIMES]; static float rmstime[4] = {0}; static float mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX}; static float maxtime[4] = {0}; static char *label[4] = {"Assignment:", "Scaling :", "Summing :", "SAXPYing :"}; static float bytes[4] = { 2 * sizeof(real) * N, 2 * sizeof(real) * N, 3 * sizeof(real) * N, 3 * sizeof(real) * N}; /* --- SETUP --- determine precision and check timing --- */ utimeStart(); for (j=0; j Message-Id: <9310271635.AA05615@perelandra.cms.udel.edu> Received: by neurocog.lrdc.pitt.edu (1.37.109.4/16.2) id AA29898; Wed, 27 Oct 93 12:34:05 -0400 From: Mark Hahn Subject: Re: stream in C To: mccalpin (John D. McCalpin) Date: Wed, 27 Oct 1993 12:34:05 -0500 (EDT) In-Reply-To: <9310271627.AA05590@perelandra.cms.udel.edu> from "John D. McCalpin" at Oct 27, 93 12:27:59 pm X-Mailer: ELM [version 2.4 PL21] Content-Type: text Content-Length: 635 Status: RO X-Status: > There is one significant difference -- your code measures the elapsed > time, while the Fortran version measures the cpu time. If your machine > is almost idle, then the *minimum* elapsed time is not a bad measure > of the cpu time, otherwise it is nearly impossible to compare. > > I thought that HP had an etime(dummy) function available from Fortran. > Did you look for that one? > > Alternatively, you can use what most folks use, provided that you can > figure out how to link C and Fortran.... good comments, I'll try all three. thanks, mark hahn. -- this space intentionally left non-blank. hahn@neurocog.lrdc.pitt.edu From Renu.Raman@Eng.Sun.COM Ukn Oct 27 14:34:43 1993 Received: from Sun.COM by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA06233; Wed, 27 Oct 93 14:34:26 -0400 Return-Path: Received: from Eng.Sun.COM (zigzag.Eng.Sun.COM) by Sun.COM (4.1/SMI-4.1) id AA21857; Wed, 27 Oct 93 11:32:15 PDT Received: from shukra.Eng.Sun.COM by Eng.Sun.COM (4.1/SMI-4.1) id AA21259; Wed, 27 Oct 93 11:31:16 PDT Received: by shukra.Eng.Sun.COM (4.1/SMI-4.1) id AA06850; Wed, 27 Oct 93 11:31:33 PDT Date: Wed, 27 Oct 93 11:31:33 PDT From: Renu.Raman@Eng.Sun.COM (Renu Raman) Message-Id: <9310271831.AA06850@shukra.Eng.Sun.COM> To: mccalpin Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy? Status: RO X-Status: >From mccalpin@perelandra.cms.udel.edu Wed Oct 27 04:50:39 1993 48.0 > >Could you please send the raw output of the tests? >I am trying to keep very complete records for all the entries... > >Thanks! Here are the details SparcClassic (uSPARC 50MhZ) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 101.6666617244482 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 57.6001 0.0937 0.0833 0.1000 Scaling : 48.0000 0.1122 0.1000 0.1333 Summing : 48.0000 0.1652 0.1500 0.1833 SAXPYing : 43.2000 0.1852 0.1667 0.2000 ******************************* SS10/41 Without E$ and 128MB of memory -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 64.99999836087227 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 48.0000 0.1087 0.1000 0.1167 Scaling : 48.0000 0.1070 0.1000 0.1167 Summing : 54.0001 0.1402 0.1333 0.1500 SAXPYing : 54.0001 0.1385 0.1333 0.1500 *************************************************** SS10/512 - dual Vikings@50MHZ with E$ (1MB) -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 24.00000095367432 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 48.0000 0.1031 0.1000 0.1100 Scaling : 48.0000 0.1051 0.1000 0.1100 Summing : 48.0000 0.1561 0.1500 0.1600 SAXPYing : 48.0000 0.1561 0.1500 0.1600 renu raman From bt@irfu.se Ukn Oct 27 18:01:29 1993 Received: from irfu.irfu.se by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA06568; Wed, 27 Oct 93 18:01:22 -0400 Return-Path: Received: from abba.irfu.se by irfu.irfu.se with SMTP (16.6/15.6) id AA04121; Wed, 27 Oct 93 22:59:25 +0100 Received: by abba.irfu.se (1.37.109.4/15.6) id AA09035; Wed, 27 Oct 93 22:58:31 +0100 From: Bo Thide' Message-Id: <9310272158.AA09035@abba.irfu.se> Subject: Streams rsults for HP9000/720 and 735 To: mccalpin Date: Wed, 27 Oct 93 22:58:31 MET Mailer: Elm [revision: 70.85] Status: RO X-Status: Hi John, Just wanted to send you the results I obtained for stream on some of our HP's: HP9000/720: -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 22.99999 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 45.7143 .0762 .0700 .0800 Scaling : 53.3334 .0662 .0600 .0700 Summing : 48.0000 .1090 .1000 .1100 SAXPYing : 53.3334 .0921 .0900 .1000 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 34.00000147521495 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 53.3334 .1011 .0900 .1100 Scaling : 53.3334 .0991 .0900 .1100 Summing : 55.3847 .1361 .1300 .1400 SAXPYing : 55.3847 .1381 .1300 .1400 HP9000/735: -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 14.0 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 80.0000 .0493 .0400 .0600 Scaling : 106.6668 .0486 .0300 .0600 Summing : 80.0001 .0663 .0600 .0800 SAXPYing : 95.9999 .0687 .0500 .0800 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 22.00000043958425 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 68.5715 .0752 .0700 .0800 Scaling : 80.0001 .0743 .0600 .0800 Summing : 80.0001 .0993 .0900 .1100 SAXPYing : 90.0001 .0972 .0800 .1000 I leave it to you to calculate the MFLOPS from these values. BTW, I used HP-UX 9.01 f77 with less than full optimization. With max optimization, dead-code elmination gave ridiculous results (infinite speed). Bo From hahn@neurocog.lrdc.pitt.edu Thu Oct 28 08:02:01 EDT 1993 Article: 12912 of comp.benchmarks Path: news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!uunet!pitt.edu!neurocog.lrdc.pitt.edu!hahn From: hahn@neurocog.lrdc.pitt.edu (Mark Hahn) Newsgroups: comp.benchmarks Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy? Message-ID: <5316@blue.cis.pitt.edu> Date: 28 Oct 93 02:21:05 GMT References: <2a7utr$lf5@nh1.u-aizu.ac.jp> <5190@blue.cis.pitt.edu> Sender: news+@pitt.edu Lines: 160 X-Newsreader: TIN [version 1.2 PL2] Status: RO appended to this message is a a fairly portable C translation of stream.f. on our hp735 and "cc +P +O3 -J +Om1 -Wl,-a,archive", I get these results: Timing calibration ; time = 760.00 usec. Increase the size of the arrays if this is < 300 and your clock precision is =< 1/100 second. --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 69.837 247.083 240.000 260.000 Scaling : 69.837 246.049 240.000 250.000 Summing : 71.832 351.013 350.000 360.000 SAXPYing : 73.945 350.143 340.000 370.000 The code is also available for anon ftp from neurocog.lrdc.pitt.edu:pub/cstream.c /* * Program: Stream * Programmer: John D. McCalpin * Revision: 2.0, September 30,1991 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * * INSTRUCTIONS: * 1) (fortran-specific, omitted.) * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the second line of the main * program to give a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonable good, on the * other hand, the optimizer might be too smart for me! * 4) Mail the results to mccalpin@perelandra.cms.udel.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks! * * This version was ported from the fortran by Mark Hahn, hahn+@pitt.edu. */ #define N (1023*1024) #define NTIMES 10 #define _HPUX_SOURCE 1 #define _POSIX_SOURCE 1 #define _XOPEN_SOURCE 1 #define _INCLUDE_POSIX_SOURCE 1 #include #include #include #include #include #ifndef MIN #define MIN(x,y) ((x)<(y)?(x):(y)) #define MAX(x,y) ((x)>(y)?(x):(y)) #endif struct tms tmsStart; void mtimeStart() { times(&tmsStart); } float mtime() { struct tms t; times(&t); return 1e3 * (float) ((t.tms_stime - tmsStart.tms_stime) + (t.tms_utime - tmsStart.tms_utime)) / (float) CLK_TCK; } typedef double real; static real a[N],b[N],c[N]; int main() { int j,k; float times[4][NTIMES]; static float rmstime[4] = {0}; static float mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX}; static float maxtime[4] = {0}; static char *label[4] = {"Assignment:", "Scaling :", "Summing :", "SAXPYing :"}; static float bytes[4] = { 2 * sizeof(real) * N, 2 * sizeof(real) * N, 3 * sizeof(real) * N, 3 * sizeof(real) * N}; /* --- SETUP --- determine precision and check timing --- */ mtimeStart(); for (j=0; j Received: from hybrid.irfu.se by irfu.irfu.se with SMTP (16.6/15.6) id AA29666; Thu, 28 Oct 93 18:50:39 +0100 Received: by hybrid.irfu.se (1.37.109.4/16.2) id AA00268; Thu, 28 Oct 93 18:51:00 +0100 From: Bo Thide' Message-Id: <9310281751.AA00268@hybrid.irfu.se> Subject: Re: Streams rsults for HP9000/720 and 735 To: mccalpin (John D. McCalpin) Date: Thu, 28 Oct 1993 18:51:00 +0100 (MET) In-Reply-To: from "John D. McCalpin" at Oct 28, 93 10:09:53 am Reply-To: bt@irfu.se Organization: Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden System: HP-UX A.09.01 9000/720 X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 874 Status: RO X-Status: You (John D. McCalpin) write: > > Thanks for the results.... > > It looks like the runs were too short for accurate measurements, > assuming that the clock is .01 second resolution. Several of the later > tests are only 3-4 clock ticks. They need to be about 10 times longer to > get accurate measurements.... But how can I do that? I have already upped the n (= array size) to the limit my present kernel can accept. Is it acceptable to instead incrase ntimes from 10 to 100? Bo -- ^ Bo Thide'---------------------------------------------Science Director- |I| Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden |R| Phone: (+46) 18-303671. Fax: (+46) 18-403100. IP: 130.238.30.23 /|F|\ INTERNET: bt@irfu.se UUCP: ...!mcvax!sunic!irfu!bt ~~U~~ ----------------------------------------------------------------sm5dfw- From bt@irfu.se Ukn Oct 28 14:14:13 1993 Received: from irfu.irfu.se by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA09436; Thu, 28 Oct 93 14:14:10 -0400 Return-Path: Received: from hybrid.irfu.se by irfu.irfu.se with SMTP (16.6/15.6) id AA29892; Thu, 28 Oct 93 19:12:19 +0100 Received: by hybrid.irfu.se (1.37.109.4/16.2) id AA00493; Thu, 28 Oct 93 19:12:40 +0100 From: Bo Thide' Message-Id: <9310281812.AA00493@hybrid.irfu.se> Subject: Re: Streams rsults for HP9000/720 and 735 To: mccalpin (John D. McCalpin) Date: Thu, 28 Oct 1993 19:12:39 +0100 (MET) In-Reply-To: from "John D. McCalpin" at Oct 28, 93 10:09:53 am Reply-To: bt@irfu.se Organization: Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden System: HP-UX A.09.01 9000/720 X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 588 Status: RO X-Status: Hi again, I have had a closer look at the code and see that one must either increase n or use another timing technique. I now see that the ntimes parameter does not affect the clock granularity. Bo -- ^ Bo Thide'---------------------------------------------Science Director- |I| Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden |R| Phone: (+46) 18-303671. Fax: (+46) 18-403100. IP: 130.238.30.23 /|F|\ INTERNET: bt@irfu.se UUCP: ...!mcvax!sunic!irfu!bt ~~U~~ ----------------------------------------------------------------sm5dfw- From bt@irfu.se Thu Oct 28 16:44:37 1993 Received: from irfu.irfu.se by perelandra.cms.udel.edu via SMTP (911016.SGI/911001.SGI) for mccalpin id AA10285; Thu, 28 Oct 93 16:44:33 -0400 Return-Path: Received: from hybrid.irfu.se by irfu.irfu.se with SMTP (16.6/15.6) id AA01167; Thu, 28 Oct 93 21:42:41 +0100 Received: by hybrid.irfu.se (1.37.109.4/16.2) id AA01831; Thu, 28 Oct 93 21:43:02 +0100 From: Bo Thide' Message-Id: <9310282043.AA01831@hybrid.irfu.se> Subject: Re: Streams rsults for HP9000/720 and 735 To: mccalpin (John D. McCalpin) Date: Thu, 28 Oct 1993 21:43:02 +0100 (MET) In-Reply-To: from "John D. McCalpin" at Oct 28, 93 03:50:04 pm Reply-To: bt@irfu.se Organization: Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden System: HP-UX A.09.01 9000/720 X-Mailer: ELM [version 2.4 PL23] Content-Type: text Content-Length: 2621 Status: RO You (John D. McCalpin) write: > > On Thu, 28 Oct 1993, Bo Thide' wrote: > > > I have had a closer look at the code and see that one must either > > increase n or use another timing technique. I now see that the ntimes > > parameter does not affect the clock granularity. > > It does not affect the timing directly, but it does allow averaging > over more ticks. I am not convinced that the answers are the same as for > timing the larger segment separately, but it is the best that can be done > if memory is limited. > I just rebuilt my 720 kernel with a four times larger stack size and increased the n parameter a bit. Here are the new HP9000/720 results: -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 118.0 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 44.4446 .3650 .3600 .3700 Scaling : 47.0589 .3400 .3400 .3400 Summing : 48.0001 .5020 .5000 .5100 SAXPYing : 52.1739 .4650 .4600 .4700 -------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLEPRECISION word -------------------------------------- Timing calibration ; time = 114.0000076964497 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 48.4849 .3330 .3300 .3400 Scaling : 50.0000 .3260 .3200 .3300 Summing : 50.0000 .4810 .4800 .4900 SAXPYing : 53.3335 .4560 .4500 .4600 As I see it, the best way would be to implement a timing routine based on gettimeofday. On the HP9000/700 this timer has a 1 microsec resolution. The etime timer has a 10 millisec resolution. Bo -- ^ Bo Thide'---------------------------------------------Science Director- |I| Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden |R| Phone: (+46) 18-303671. Fax: (+46) 18-403100. IP: 130.238.30.23 /|F|\ INTERNET: bt@irfu.se UUCP: ...!mcvax!sunic!irfu!bt ~~U~~ ----------------------------------------------------------------sm5dfw- From hahn@neurocog.lrdc.pitt.edu Thu Oct 28 08:02:01 EDT 1993 Article: 12912 of comp.benchmarks Path: news.udel.edu!darwin.sura.net!math.ohio-state.edu!cs.utexas.edu!uunet!pitt.edu!neurocog.lrdc.pitt.edu!hahn From: hahn@neurocog.lrdc.pitt.edu (Mark Hahn) Newsgroups: comp.benchmarks Subject: Re: IBM RS/6000 or HP Apollo 9000: which to buy? Message-ID: <5316@blue.cis.pitt.edu> Date: 28 Oct 93 02:21:05 GMT References: <2a7utr$lf5@nh1.u-aizu.ac.jp> <5190@blue.cis.pitt.edu> Sender: news+@pitt.edu Lines: 160 X-Newsreader: TIN [version 1.2 PL2] Status: RO appended to this message is a a fairly portable C translation of stream.f. on our hp735 and "cc +P +O3 -J +Om1 -Wl,-a,archive", I get these results: Timing calibration ; time = 760.00 usec. Increase the size of the arrays if this is < 300 and your clock precision is =< 1/100 second. --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 69.837 247.083 240.000 260.000 Scaling : 69.837 246.049 240.000 250.000 Summing : 71.832 351.013 350.000 360.000 SAXPYing : 73.945 350.143 340.000 370.000 The code is also available for anon ftp from neurocog.lrdc.pitt.edu:pub/cstream.c /* * Program: Stream * Programmer: John D. McCalpin * Revision: 2.0, September 30,1991 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * * INSTRUCTIONS: * 1) (fortran-specific, omitted.) * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the second line of the main * program to give a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonable good, on the * other hand, the optimizer might be too smart for me! * 4) Mail the results to mccalpin@perelandra.cms.udel.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks! * * This version was ported from the fortran by Mark Hahn, hahn+@pitt.edu. */ #define N (1023*1024) #define NTIMES 10 #define _HPUX_SOURCE 1 #define _POSIX_SOURCE 1 #define _XOPEN_SOURCE 1 #define _INCLUDE_POSIX_SOURCE 1 #include #include #include #include #include #ifndef MIN #define MIN(x,y) ((x)<(y)?(x):(y)) #define MAX(x,y) ((x)>(y)?(x):(y)) #endif struct tms tmsStart; void mtimeStart() { times(&tmsStart); } float mtime() { struct tms t; times(&t); return 1e3 * (float) ((t.tms_stime - tmsStart.tms_stime) + (t.tms_utime - tmsStart.tms_utime)) / (float) CLK_TCK; } typedef double real; static real a[N],b[N],c[N]; int main() { int j,k; float times[4][NTIMES]; static float rmstime[4] = {0}; static float mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX}; static float maxtime[4] = {0}; static char *label[4] = {"Assignment:", "Scaling :", "Summing :", "SAXPYing :"}; static float bytes[4] = { 2 * sizeof(real) * N, 2 * sizeof(real) * N, 3 * sizeof(real) * N, 3 * sizeof(real) * N}; /* --- SETUP --- determine precision and check timing --- */ mtimeStart(); for (j=0; j Received: from localhost.desy.de by rec06.desy.de via SMTP (920330.SGI/920502.SGI.AUTO) for mccalpin@perelandra.cms.udel.edu id AA10232; Thu, 4 Nov 93 18:34:39 +0100 Message-Id: <9311041734.AA10232@rec06.desy.de> To: mccalpin Subject: stream result Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 04 Nov 1993 18:34:38 +0100 From: Karsten Kuenne Status: RO X-Status: Hi John, this is what I get on a SGI Challenge ( 1 cpu ). zarah2:~/benchmarks-> uname -a 18:25 IRIX zarah2 5.1.1.1 10011612 IP19 mips zarah2:~/benchmarks-> hinv 18:25 18 150 MHZ IP19 Processors CPU: MIPS R4400 Processor Chip Revision: 5.0 FPU: MIPS R4010 Floating Point Chip Revision: 0.0 Data cache size: 16 Kbytes Instruction cache size: 16 Kbytes Secondary unified instruction/data cache size: 1 Mbyte Main memory size: 512 Mbytes, 4-way interleaved I/O board, Ebus slot 15: IO4 revision 1 Integral IO4 serial ports: 4 Integral Ethernet controller: et0, Ebus slot 15 Integral SCSI controller 1: Version WD33C95A Disk drive: unit 8 on SCSI controller 1 Disk drive: unit 4 on SCSI controller 1 Disk drive: unit 3 on SCSI controller 1 Disk drive: unit 2 on SCSI controller 1 Disk drive: unit 1 on SCSI controller 1 Integral SCSI controller 0: Version WD33C95A Disk drive: unit 5 on SCSI controller 0 Disk drive: unit 4 on SCSI controller 0 Disk drive: unit 1 on SCSI controller 0 Integral Ethernet: ec0, version 1 Integral IO4 parallel port: Ebus slot 15 VME bus: adapter 0 mapped to adapter 61 VME bus: adapter 61 Compiler options: f77 -ddopt -mips2 -O4 -non_shared -o stream stream.f zarah2:~/benchmarks-> ./stream 18:27 -------------------------------------- Single precision appears to have 7 digits of accuracy Assuming 4 bytes per default REAL word -------------------------------------- Timing calibration ; time = 42.00000 hundredths of a second Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 66.6667 0.1271 0.1200 0.1300 Scaling : 66.6667 0.1291 0.1200 0.1400 Summing : 66.6667 0.1921 0.1800 0.2000 SAXPYing : 63.1579 0.1941 0.1900 0.2000 And this is for cstream: Compiler options: cc -sopt -O4 -mips2 -non_shared -o cstream cstream.c -lfastm zarah2:~/benchmarks-> ./cstream 18:31 Timing calibration ; time = 980.00 usec. Increase the size of the arrays if this is < 300 and your clock precision is =< 1/100 second. --------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Assignment: 69.837 245.051 240.000 250.000 Scaling : 69.837 250.080 240.000 260.000 Summing : 69.837 368.076 360.000 380.000 SAXPYing : 71.832 361.040 350.000 370.000 Best regards, Karsten Kuenne. -- //////////////////////////////////////////////////////////////////// Karsten Kuenne, DESY (-R2-), Notkestr. 85, 22607 Hamburg, Germany phone: +49-40-8998-3315 fax: +49-40-8994-4429 e-mail: , ,