stream benchmark

From: Robert Bell (Robert.Bell@mel.dit.csiro.au)
Date: Mon Feb 20 1995 - 23:38:02 CST


John,
        I tried the Cray Research f90 compiler on the stream benchmark
(non-parallel version). It seems the compiler was able to optimise
away all the loops.
        A new version of the main program follows, which the compiler
is not able to optimise away. I don't like this version - it is
messy, but the basic ideas are
        - make all the values in an array distinct
        - use a somewhat arbitrary element of an array after it is defined.
        - redefine a somewhat arbitrary element of an array before use
        - sum all elements of the final array after each iteration of
          the k=1,NTIMES loop.
        - printout the checksum

I think this code may give different checksums on machines with 32-bit
integers - it won't matter.

The Fortran style needs fixing to be similar to yours - I've left in
some comments to note what changes I've made.

The code works under both f90 and cf77 on the Cray, although the
checksums are different at the fifth decimal place. The order of
summation may be important crucial - the sum of the array elements is
close to zero.

It's yours!

Regards
Rob. Bell ( email: csrcb@mel.dit.csiro.au )

--
/ Robert.Bell@mel.dit.csiro.au |  CSIRO Supercomputing Facility Manager  \
| CSIRO Division of Information Technology, Supercomputing Support Group |
| 723 Swanston Street | tel: {+61 3,03,} 282 2620 or {+61,0} 18 108 333  |
\ Carlton VIC 3053 Australia   |  fax: {+61 3,03,} 282 2600              /
From 1995 May 8: | tel: {+61 3,03,} 9282 2620 | fax: {+61 3,03,} 9282 2600/

* Program: Stream * Programmer: John D. McCalpin * Revision: 2.0, September 30,1991 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. These numbers reveal the * quality of code generation for simple uncacheable kernels as well * as showing the cost of floating-point operations relative to memory * accesses. * * INSTRUCTIONS: * 1) Stream requires a cpu timing function called second(). * A sample is shown below. This is unfortunately rather * system dependent. It helps to know the granularity of the * timing. The code below assumes that the granularity is * 1/100 seconds. * 2) Stream requires a good bit of memory to run. * Adjust the Parameter 'N' in the second line of the main * program to give a 'timing calibration' of at least 20 clicks. * This will provide rate estimates that should be good to * about 5% precision. * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonable good, on the * other hand, the optimizer might be too smart for me! * 4) Mail the results to mccalpin@perelandra.cms.udel.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * * Thanks! * program Stream parameter (N = 1 000 000, NTIMES = 10) real a(N),b(N),c(N),times(4,NTIMES) real rmstime(4),mintime(4),maxtime(4) character*11 label(4) real second integer realsize,nbpw,bytes(4) C Original. external second,realsize external realsize C C Extra variables to thwart optimisers which can throw away the C loops. The Cray Research f90 compiler appeared to optimise C out all the loops before these were inserted. C A checksum is incremented after each loop, using an array C element determined by index. An array element is replaced by C the checksum before each loop. integer index real sum C data rmstime/4*0.0/,mintime/4*1.0e+36/,maxtime/4*0.0/ data label/'Assignment:','Scaling :','Summing :', $ 'SAXPYing :'/ data bytes/2,2,3,3/

* --- SETUP --- determine precision and check timing ---

nbpw = realsize()

t = second(t0) do 10 j=1,N C C With these original assignments, all values in any array were C the same, and no array operations are needed. C a(j) = 1.0 C b(j) = 2.0 C c(j) = 0.0 C C With the following, optimisation is more difficult. a(j) = real (j - N/2) / real (N) C C Note that arrays b and c no longer need to be defined here. 10 continue t = second(t0)-t print *,'Timing calibration ; time = ',t*100,' hundredths', $ ' of a second' print *,'Increase the size of the arrays if this is <30 ', $ ' and your clock precision is =<1/100 second' print *,'---------------------------------------------------'

index = N sum = 0.0 * --- MAIN LOOP --- repeat test cases NTIMES times --- do 1000 k=1,NTIMES

t = second(t0) do 20 j=1,N c(j) = a(j) 20 continue t = second(t0)-t times(1,k) = t index = mod (index**2, N) + 1 sum = sum + c(index)

c(index) = sum t = second(t0) do 30 j=1,N C Original. c(j) = 3.0e0*a(j) b(j) = 3.0e0*c(j) 30 continue t = second(t0)-t times(2,k) = t index = mod (index**2, N) + 1 sum = sum + b(index)

b(index) = sum t = second(t0) do 40 j=1,N C Original. c(j) = a(j)+b(j) a(j) = b(j)+c(j) 40 continue t = second(t0)-t times(3,k) = t index = mod (index**2, N) + 1 sum = sum + a(index)

a(index) = sum t = second(t0) do 50 j=1,N c(j) = a(j)+3.0e0*b(j) 50 continue t = second(t0)-t times(4,k) = t index = mod (index**2, N) + 1 sum = sum + c(index) C C Do a checksum each loop, to ensure each value of c is C defined. do 60 j=1,N sum = sum + c(j) 60 continue 1000 continue

* --- SUMMARY --- do 300 k=1,NTIMES do 200 j=1,4 rmstime(j) = rmstime(j) + times(j,k)**2 mintime(j) = min( mintime(j), times(j,k) ) maxtime(j) = max( maxtime(j), times(j,k) ) 200 continue 300 continue write (*,9000) do 320 j=1,4 rmstime(j) = sqrt(rmstime(j)/float(NTIMES)) write (*,9010) label(j),N*bytes(j)*nbpw/mintime(j)/1.0e6, $ rmstime(j),mintime(j),maxtime(j) 320 continue Write (*,*) Write (*,*) ' Result check, using an index and checksum.' Write (*,*) ' For this test, n = ', n, ', ntimes = ', ntimes Write (*,*) ' index = ', index, ', sum = ', sum Write (*,*) ' For standard test, n = 1000000, ntimes = 10' Write (*,*) ' index = 152026, sum = -16664162642847.06'

9000 format ('Function',5x, $ 'Rate (MB/s) RMS time Min time Max time') 9010 format (a,4(f10.4,2x)) end



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:04 CDT