Streams results for the CRAY SV1

From: Joe Glenski (glenski@sgi.com)
Date: Tue May 23 2000 - 11:32:08 CDT

  • Next message: H.W. Stockman: "Stream results, Athlon 800"

    John-

    Shown below are the most recent "stream" results for the CRAY SV1.
    Please add them to your table.

    The system used has 32 SV1 Rev C cpus, and 32 Gbytes of type N memory.
    It was running UNICOS 10.0.

    The compiler used was cf90 3.4.0.0, with the following options:
          -dp -O scalar3,vector3,task3

    A copy of the program used and the summary result lines are
    also included below.

    Regards,

    -Joe Glenski.

    Benchmarking Group
    Cray Inc., 655F Lone Oak Drive, Eagan, MN 55121 USA
    glenski@cray.com

    ------------------------------------------------------------------------

    Number of CPUs: 2 to 32
    Array size: 4001 Kwords
    Array padding: 20248 Words

    STREAM Memory Bandwidth for Cray SV1 5/23/2000

    All results are in MB/s --- 1 MB=10^6 B, *not* 2^20 B

    -----------------------------------------------------------
    Machine ID ncpus COPY SCALE ADD TRIAD
    -----------------------------------------------------------
    Cray_SV1 32 25350.616 25252.665 25325.902 24780.574
    Cray_SV1 16 24476.284 24760.294 24915.577 24425.151
    Cray_SV1 12 22093.827 21930.989 22710.514 22399.891
    Cray_SV1 8 18063.235 17783.525 18874.605 18615.010
    Cray_SV1 4 9373.209 9208.710 9859.369 9858.707
    Cray_SV1 2 4808.209 4677.492 4984.247 4991.767

    -----------------------------------------------------------------------

     STREAM benchmark
     ----------------
     Number of CPUs: 2
     Array size: 4001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
        32 32 32 4808.209 4677.492 4984.247 4991.767
     STOP executed at line 146 in Fortran routine 'STREAM'
     CPU: 17.399s, Wallclock: 8.802s, 6.2% of 32-CPU Machine
     Memory HWM: 12835631, Stack HWM: 20696, Stack segment expansions: 0

     STREAM benchmark
     ----------------
     Number of CPUs: 4
     Array size: 4001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
        32 32 32 9373.209 9208.710 9859.369 9858.707
     STOP executed at line 146 in Fortran routine 'STREAM'
     CPU: 17.708s, Wallclock: 4.526s, 12.2% of 32-CPU Machine
     Memory HWM: 12835632, Stack HWM: 20696, Stack segment expansions: 0

     STREAM benchmark
     ----------------
     Number of CPUs: 8
     Array size: 4001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
        32 32 32 18063.235 17783.525 18874.605 18615.010
     STOP executed at line 146 in Fortran routine 'STREAM'
     CPU: 18.773s, Wallclock: 2.482s, 23.6% of 32-CPU Machine
     Memory HWM: 12835633, Stack HWM: 20696, Stack segment expansions: 0

     STREAM benchmark
     ----------------
     Number of CPUs: 12
     Array size: 4001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
        32 32 32 22093.827 21930.989 22710.514 22399.891
     STOP executed at line 146 in Fortran routine 'STREAM'
     CPU: 23.305s, Wallclock: 2.105s, 34.6% of 32-CPU Machine
     Memory HWM: 12845875, Stack HWM: 20696, Stack segment expansions: 0

     STREAM benchmark
     ----------------
     Number of CPUs: 16
     Array size: 4001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
        32 32 32 24476.284 24760.294 24915.577 24425.151
     STOP executed at line 146 in Fortran routine 'STREAM'
     CPU: 28.548s, Wallclock: 2.018s, 44.2% of 32-CPU Machine
     Memory HWM: 12866356, Stack HWM: 20696, Stack segment expansions: 0

     STREAM benchmark
     ----------------
     Number of CPUs: 32
     Array size: 4001 Kwords
     Array padding: 20248 Words

      Aoff Boff Coff Assignment:Scaling: Summing: SAXPYing:
     -------------------------------------------------------------
        32 32 32 25350.616 25252.665 25325.902 24780.574
     STOP executed at line 146 in Fortran routine 'STREAM'
     CPU: 57.511s, Wallclock: 2.024s, 88.8% of 32-CPU Machine
     Memory HWM: 12938041, Stack HWM: 20696, Stack segment expansions: 0

    ------------------------------------------------------------------------

    * Program: Stream
    * Programmer: John D. McCalpin
    * Revision: 2.0, September 30,1991
    *
    * This program measures memory transfer rates in MB/s for simple
    * computational kernels coded in Fortran. These numbers reveal the
    * quality of code generation for simple uncacheable kernels as well
    * as showing the cost of floating-point operations relative to memory
    * accesses.
    *
    * INSTRUCTIONS:
    * 1) Stream requires a cpu timing function called second().
    * A sample is shown below. This is unfortunately rather
    * system dependent. It helps to know the granularity of the
    * timing. The code below assumes that the granularity is
    * 1/100 seconds.
    * 2) Stream requires a good bit of memory to run.
    * Adjust the Parameter 'N' in the second line of the main
    * program to give a 'timing calibration' of at least 20 clicks.
    * This will provide rate estimates that should be good to
    * about 5% precision.
    * 3) Compile the code with full optimization. Many compilers
    * generate unreasonably bad code before the optimizer tightens
    * things up. If the results are unreasonable good, on the
    * other hand, the optimizer might be too smart for me!
    * 4) Mail the results to mccalpin@perelandra.cms.udel.edu
    * Be sure to include:
    * a) computer hardware model number and software revision
    * b) the compiler flags
    * c) all of the output from the test case.
    *
    * Thanks!
    *
            program Stream
            parameter (N = 4 001 *1024, NTIMES = 2)
            integer ipad,aoff,boff,coff
            parameter (ipad = 20248 )
            real a(N+ipad),b(N+ipad),c(N+ipad),times(4,NTIMES)
            common /abccom/a,b,c
            real rmstime(4),mintime(4),maxtime(4)
            character*11 label(4)
            real second
            integer realsize,nbpw,bytes(4)
            external second,realsize
            data rmstime/4*0.0/,mintime/4*1.0e+36/,maxtime/4*0.0/
            data label/'Assignment:','Scaling: ','Summing: ',
         $ 'SAXPYing: '/
            data bytes/2,2,3,3/
            data aoff,boff,coff/0,0,0/
            etime()=0.001*timef()

    * --- SETUP --- determine precision and check timing ---

            write(6,9020) ncpu(),N/1024,ipad

            write(6,9030) (label(j),j=1,4)

            nbpw = realsize()

            t = etime()
            do 10 j=1,N
                a(j) = 1.0
                b(j) = 2.0
                c(j) = 0.0
       10 continue
            t = etime()-t

    c print *,'Timing calibration ; time = ',t*100,' hundredths',
    c $ ' of a second'
    c print *,'Increase the size of the arrays if this is <30 ',
    c $ ' and your clock precision is =<1/100 second'
    c print *,'---------------------------------------------------'

            do coff=8,32,8
            do boff=8,32,8
            do aoff=8,32,8
    * --- MAIN LOOP --- repeat test cases NTIMES times ---
            do 1000 k=1,NTIMES

                t = etime()
                do 20 j=1,N
                    c(coff+j) = a(aoff+j)
       20 continue
                t = etime()-t
                times(1,k) = t

                t = etime()
                do 30 j=1,N
                    c(coff+j) = 3.0e0*a(aoff+j)
       30 continue
                t = etime()-t
                times(2,k) = t

                t = etime()
                do 40 j=1,N
                    c(coff+j) = a(aoff+j)+b(boff+j)
       40 continue
                t = etime()-t
                times(3,k) = t

                t = etime()
                do 50 j=1,N
                    c(coff+j) = a(aoff+j)+3.0e0*b(boff+j)
       50 continue
                t = etime()-t
                times(4,k) = t
                call dummysub(a,b,c,n)
     1000 continue

    * --- SUMMARY ---
            do 300 k=1,NTIMES
                do 200 j=1,4
                    rmstime(j) = rmstime(j) + times(j,k)**2
                    mintime(j) = min( mintime(j), times(j,k) )
                    maxtime(j) = max( maxtime(j), times(j,k) )
      200 continue
      300 continue

          write(6,9040) aoff,boff,coff,
         . (N*bytes(j)*nbpw/mintime(j)/1.0e6,j=1,4)

    c write (*,9000)
    c do 320 j=1,4
    c rmstime(j) = sqrt(rmstime(j)/float(NTIMES))
    c write (*,9010) label(j),N*bytes(j)*nbpw/mintime(j)/1.0e6,
    c $ rmstime(j),mintime(j),maxtime(j)
    c 320 continue
            end do
            end do
            end do

     9000 format (' Function',5x,
         $ 'Rate (MB/s) RMS time Min time Max time')
     9010 format (1x,a11,4(2x,f11.4))

     9020 format(/' STREAM benchmark'
         . /' ----------------'
         . /' Number of CPUs: ',i8
         . /' Array size: ',i8,' Kwords'
         . /' Array padding: ',i8,' Words'/)

     9030 format (/1x,' Aoff Boff Coff',2x,4a11
         . /1x,61('-'))
     9040 format (1x,3i5,2x,4f11.3)

            stop
            end

    *-------------------------------------
    * Sample timing routine
    * This code works on Sun and Silicon Graphics machines.
    * real function second(t0)
    * real dummy(2)
    * second = etime(dummy)
    * end
    * Sample timing routine
    * This code works on IBM RS/6000 machines
    cray real function second(t0)
    cray second = mclock()*0.01
    cray end

    *-------------------------------------
    * INTEGER FUNCTION realsize()
    *
    * A semi-portable way to determine the precision of default REAL
    * in Fortran.
    * Here used to guess how many bytes of storage a real number occupies.
    *
            integer function realsize()
            double precision ref(30)
            real test
            double precision pi
            logical silent
            data silent/.true./

    C Test #1 - compare double precision pi to acos(-1.0e0)

            pi = 3.14159 26535 89793 23846 26433 83279 50288 d0
            picalc = acos(-1.0e0)
            diff = abs(picalc-pi)
            if (diff.eq.0.0) then
               if (.not. silent) then
                 print *,'Test #1 Failed = picalc=piexact'
                print *,'Apparently Single=Double Precision'
                print *,'Proceeding to Test #2'
                print *,' '
               end if
                goto 200
            else
                ndigits = -log10(abs(diff))+0.5
                goto 1000
            endif

    C Test #2 - compare single(1.0d0+delta) to 1.0e0

      200 do 10 j=1,30
                ref(j) = 1.0d0+10.0d0**(-j)
       10 continue

            do 20 j=1,30
                test = ref(j)
                ndigits = j
                call dummy(test,result)
                if (test.eq.1.0e0) then
                    goto 1000
                endif
       20 continue
            if (.not. silent) then
            print *,'Test #2 failed - Precision appears to exceed 30 digits'
            print *,'Proceeding to Test #3'
            end if
            goto 300

    C Test #3 - abs(sqrt(1.0d0)-sqrt(1.0e0))

      300 diff = abs(sqrt(1.0d0)-sqrt(1.0e0))
            if (diff.eq.0.0) then
               if (.not. silent) then
                print *,'Test Failed - sqrt(1.0e0)=sqrt(1.0d0)'
                print *,'Apparently Single=Double Precision'
                print *,'Giving up'
               end if
                goto 400
            else
                ndigits = -log10(abs(diff))+0.5
                goto 1000
            endif

     1000 continue
            if (.not. silent) then
            write (*,'(a)') '--------------------------------------'
            write (*,'(1x,a,i2,a)') 'Single precision appears to have ',
         $ ndigits,' digits of accuracy'
            write (*,'(1x,a,i1,a)') 'Assuming ',realsize,
         $ ' bytes per default REAL word'
            write (*,'(a)') '--------------------------------------'
            end if
            if (ndigits.le.8) then
                realsize = 4
            else
                realsize = 8
            endif
            return

      400 print *,'Hmmmm. I am unable to determine the size of a REAL'
              print *,'Please enter the number of Bytes per REAL number : '
              read (*,*) realsize
            if (realsize.ne.4.and.realsize.ne.8) then
                print *,'Your answer ',realsize,' does not make sense!'
                print *,'Try again!'
                  print *,'Please enter the number of Bytes per ',
         $ 'REAL number : '
                  read (*,*) realsize
            endif
            print *,'You have manually entered a size of ',realsize,
         $ ' bytes per REAL number'
            write (*,'(a)') '--------------------------------------'
            end

            subroutine dummy(q,r)
            r = cos(q)
            return
            end
            subroutine dummysub(a,b,c,n)
            return
            end
    ------------------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Tue May 23 2000 - 17:41:12 CDT