Stream benchmark

From: Robert Bell (Robert.Bell@mel.dit.csiro.au)
Date: Mon Aug 01 1994 - 00:40:48 CDT


John,
     I have been following your stream benchmark, and your advocacy of
the importance of memory speeds. I have some of my own tests which
look particularly at strides, but I have not had time to develop them
much.
     I have recently run your test on a Cray Y-MP 4E/464, under UNICOS
7.0.5.3, and found that the results were not as expected. (I implemented
the timing code by removing 'second' from the external declaration, and
supplying no external routine, so the Cray intrinsic was used.)
     Here are the results with

cf77 -V -O vector3 -o stream stream.orig.f
Cray CF77 Version 6.0 (6.55) 08/01/94 15:36:00
Cray FPP Version 6.0 (3.06F1) 08/01/94 15:36:00
 cft77-42 cf77: Cray CFT77 Version 6.0.3.9 (008708) 08/01/94 15:36:01
 cft77-2 cf77: COMPILE TIME .669 SECONDS
 cft77-6 cf77: MAXIMUM FIELD LENGTH 421358 DECIMAL WORDS
 cft77-3 cf77: 236 SOURCE LINES
 cft77-4 cf77: 0 ERROR, 0 WARNING, 0 CAUTION, 0 COMMENT, 0 NOTE, 0 ANSI
 cft77-5 cf77: CODE: 567 WORDS, DATA: 281 WORDS

--------------------------------------
 Single precision appears to have 14 digits of accuracy
 Assuming 8 bytes per default REAL word
--------------------------------------
 Timing calibration ; time = 0.1128762 hundredths of a second
 Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second
 ---------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment:********** 0.0000 0.0000 0.0000
Scaling :********** 0.0000 0.0000 0.0000
Summing :********** 0.0000 0.0000 0.0000
SAXPYing :********** 0.0000 0.0000 0.0000

 
It appears that the compiler was smart enough to optimise away some of
your code. I changed some of the loops as follows
                    do 30 j=1,N
C Original. c(j) = 3.0e0*a(j)
                b(j) = 3.0e0*c(j)
   30 continue

            do 40 j=1,N
C Original. c(j) = a(j)+b(j)
                a(j) = b(j)+c(j)
   40 continue

The results were then

--------------------------------------
 Single precision appears to have 14 digits of accuracy
 Assuming 8 bytes per default REAL word
--------------------------------------
 Timing calibration ; time = 1.9692438 hundredths of a second
 Increase the size of the arrays if this is <30 and your clock precision is =<1/100 second
 ---------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 2323.1070 0.0069 0.0069 0.0070
Scaling : 2354.6871 0.0069 0.0068 0.0070
Summing : 2365.9413 0.0105 0.0101 0.0107
SAXPYing : 2365.2446 0.0105 0.0101 0.0108

These look correct for a 2-section Y-MP. (The results in your table
are for a 4-section Y-MP.)

Note that the setting of c can still be removed from the do 10 loop,
and that the values of c should be used after the do 50 loop, to
prevent optimisers from taking further liberties with your code.

Also, it is good to avoid the first value returned by some of the
timing routines on some systems - the first value includes some setup
time.

-- 
Regards
Rob. Bell		     (	email: csrcb@mel.dit.csiro.au  )
--
/ Robert.Bell@mel.dit.csiro.au |  CSIRO Supercomputing Facility Manager  \
| CSIRO Division of Information Technology, Supercomputing Support Group |
| 723 Swanston Street | tel: {+61 3,03,} 282 2620 or {+61,0} 18 108 333  |
\ Carlton VIC 3053 Australia   |  fax: {+61 3,03,} 282 2600              /



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:04 CDT