Stream V5 - OS Redhat 7.0/Windows 2000, PGI pgf90 3.2

From: Winfrid Tschiedel (winfrid.tschiedel@hpc.fujitsu-siemens.com)
Date: Thu Mar 01 2001 - 09:31:31 CST

  • Next message: Henning, John (Nashua): "Stream results for DS20 667"

    Hello John,

    Attached you will find some stream runs measured on a
    Pentium 4 system ( 1.5 GHz, 512 mb RDRAM 800 ) manufactured by Fujitsu
    Siemens Computers,
    called Celsius 460.

    Description of the attachments :
    cpuinfo : contents of /proc/cpuinfo ( Linux )
    meminfo : contents of /proc/meminfo ( Linux )
    stream_d.f : used fortran program, identical with your program, except
    size of the array and used
    routine for timing

    linux.log : measurement on linux ( Redhat 7.0 ), 6 runs including
    compilation ( 2 different arraysizes )
    run 1,4 ( only high optimization )
    run 2,5 ( high optimization + prefetch )
    run 3,6 ( high optimization + prefetch + sse1 instructions for Pentium
    III enabled )

    w2k.log : measurement on Windows 2000 SP1 ( similar to the linux run,
    only large array size )

    I have 2 remarks :

    Windows 2000 seems to be slightly better than linux, while Windows NT 4
    SP6 was even a little bit
    worse than Linux ( measurement not included ).

    I don't understand

    "Your clock granularity appears to be less than one microsecond
     Your clock granularity/precision appears to be 1 microseconds"

    because I use fortran90 intrinsic system_clock for the time measurement,
    and
    the clockrate returned by system_clock is 1000, so the
    granularity/precision should be
    1 millisecond. Can you help me to understand me the output above.

    Best regards,

    Winfrid

     test_system_clock.f:
             integer ic1,ir1,ic2,ir2,icn,irn
             common /cm_time/ icn,irn
             call system_clock(ic1,ir1)
             do i=1,9999
               call dummy(icn,irn)
               call system_clock(icn,irn)
             enddo
             call system_clock(ic2,ir2)
             write (*,*) " Start ",ic1,ir1
             write (*,*) " End ",ic2,ir2
             write (*,*) " diff ",dble(ic2-ic1)/dble(ir1)," sec. "
             end
             subroutine dummy(iarg1,iarg2)
             integer iarg1,iarg2
             iarg1=iarg2
             end

    pgf90 -fast test_system_clock.f
    [winfrid@p4 ~]$ time ./a.out
      Start 0 1000
      End 14 1000
      diff 1.4000000000000002E-002 sec.
    0.010u 0.000s 0:00.01 100.0% 0+0k 0+0io 131pf+0w

    --
    _______________________________________________________________________________
    

    Fujitsu Siemens Computers OEA VC HPC Email: winfrid.tschiedel@hpc.fujitsu-siemens.com Otto-Hahn-Ring 6 Tel. : ++49-89-636-45652 81739 Muenchen Fax : ++49-89-636-44088

    MS Exchange : winfrid.tschiedel@fujitsu-siemens.com

    processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 0 model name : Intel(R) Pentium(R) 4 CPU 1500MHz stepping : 7 cpu MHz : 1496.749 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss tm bogomips : 2988.44

    total: used: free: shared: buffers: cached: Mem: 524955648 89444352 435511296 0 7536640 55529472 Swap: 1077469184 0 1077469184 MemTotal: 512652 kB MemFree: 425304 kB MemShared: 0 kB Buffers: 7360 kB Cached: 54228 kB Active: 41704 kB Inact_dirty: 19884 kB Inact_clean: 0 kB Inact_target: 4 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 512652 kB LowFree: 425304 kB SwapTotal: 1052216 kB SwapFree: 1052216 kB

    * Program: STREAM * Programmer: John D. McCalpin * Revision: 5.0, July 30, 2000 * * This program measures memory transfer rates in MB/s for simple * computational kernels coded in Fortran. * The intent is to demonstrate the extent to which ordinary user * code can exploit the main memory bandwidth of the system under * test. *========================================================================= * The STREAM web page is at: * http://www.streambench.org * * Most of the content is currently hosted at: * http://www.cs.virginia.edu/stream/ * * BRIEF INSTRUCTIONS: * 0) See http://www.cs.virginia.edu/stream/ref.html for details * 1) STREAM requires a timing function called second(). * Several examples are provided in this directory. * "CPU" timers are only allowed for uniprocessor runs. * "Wall-clock" timers are required for all multiprocessor runs. * 2) The STREAM array sizes must be set to size the test. * The value "N" must be chosen so that each of the three * arrays is at least 4x larger than the sum of all the last- * level caches used in the run, or 1 million elements, which- * ever is larger. * ------------------------------------------------------------ * Note that you are free to use any array length and offset * that makes each array 4x larger than the last-level cache. * The intent is to determine the *best* sustainable bandwidth * available with this simple coding. Of course, lower values * are usually fairly easy to obtain on cached machines, but * by keeping the test to the *best* results, the answers are * easier to interpret. * You may put the arrays in common or not, at your discretion. * There is a commented-out COMMON statement below. * Fortran90 "allocatable" arrays are fine, too. * ------------------------------------------------------------ * 3) Compile the code with full optimization. Many compilers * generate unreasonably bad code before the optimizer tightens * things up. If the results are unreasonably good, on the * other hand, the optimizer might be too smart for me * Please let me know if this happens. * 4) Mail the results to mccalpin@cs.virginia.edu * Be sure to include: * a) computer hardware model number and software revision * b) the compiler flags * c) all of the output from the test case. * Please let me know if you do not want your name posted along * with the submitted results. * 5) See the web page for more comments about the run rules and * about interpretation of the results. * * Thanks, * Dr. Bandwidth *========================================================================= * PROGRAM stream * IMPLICIT NONE C .. Parameters .. INTEGER n,offset,ndim,ntimes PARAMETER (n=12000000,offset=0,ndim=n+offset,ntimes=10) C .. C .. Local Scalars .. DOUBLE PRECISION dummy,scalar,t INTEGER j,k,nbpw,quantum C .. C .. Local Arrays .. DOUBLE PRECISION maxtime(4),mintime(4),avgtime(4), $ times(4,ntimes) INTEGER bytes(4) CHARACTER label(4)*11 C .. C .. External Functions .. DOUBLE PRECISION second INTEGER checktick,realsize EXTERNAL second,checktick,realsize C .. C .. Intrinsic Functions .. C INTRINSIC dble,max,min,nint,sqrt C .. C .. Arrays in Common .. DOUBLE PRECISION a(ndim),b(ndim),c(ndim) C .. C .. Common blocks .. * COMMON a,b,c C .. C .. Data statements .. DATA avgtime/4*0.0D0/,mintime/4*1.0D+36/,maxtime/4*0.0D0/ DATA label/'Copy: ','Scale: ','Add: ', $ 'Triad: '/ DATA bytes/2,2,3,3/,dummy/0.0d0/ C ..

    * --- SETUP --- determine precision and check timing ---

    nbpw = realsize()

    WRITE (*,FMT=9010) 'Array size = ',n WRITE (*,FMT=9010) 'Offset = ',offset WRITE (*,FMT=9020) 'The total memory requirement is ', $ 3*nbpw*n/ (1024*1024),' MB' WRITE (*,FMT=9030) 'You are running each test ',ntimes,' times' WRITE (*,FMT=9030) '--' WRITE (*,FMT=9030) 'The *best* time for each test is used' WRITE (*,FMT=9030) '*EXCLUDING* the first and last iterations'

    !$OMP PARALLEL DO DO 10 j = 1,n a(j) = 2.0d0 b(j) = 0.5D0 c(j) = 0.0D0 10 CONTINUE t = second(dummy) !$OMP PARALLEL DO DO 20 j = 1,n a(j) = 0.5d0*a(j) 20 CONTINUE t = second(dummy) - t PRINT *,'----------------------------------------------------' quantum = checktick() WRITE (*,FMT=9000) $ 'Your clock granularity/precision appears to be ',quantum, $ ' microseconds' PRINT *,'----------------------------------------------------'

    * --- MAIN LOOP --- repeat test cases NTIMES times --- scalar = 0.5d0*a(1) DO 70 k = 1,ntimes

    t = second(dummy) a(1) = a(1) + t !$OMP PARALLEL DO DO 30 j = 1,n c(j) = a(j) 30 CONTINUE t = second(dummy) - t c(n) = c(n) + t times(1,k) = t

    t = second(dummy) c(1) = c(1) + t !$OMP PARALLEL DO DO 40 j = 1,n b(j) = scalar*c(j) 40 CONTINUE t = second(dummy) - t b(n) = b(n) + t times(2,k) = t

    t = second(dummy) a(1) = a(1) + t !$OMP PARALLEL DO DO 50 j = 1,n c(j) = a(j) + b(j) 50 CONTINUE t = second(dummy) - t c(n) = c(n) + t times(3,k) = t

    t = second(dummy) b(1) = b(1) + t !$OMP PARALLEL DO DO 60 j = 1,n a(j) = b(j) + scalar*c(j) 60 CONTINUE t = second(dummy) - t a(n) = a(n) + t times(4,k) = t 70 CONTINUE

    * --- SUMMARY --- DO 90 k = 2,ntimes-1 DO 80 j = 1,4 avgtime(j) = avgtime(j) + times(j,k) mintime(j) = min(mintime(j),times(j,k)) maxtime(j) = max(maxtime(j),times(j,k)) 80 CONTINUE 90 CONTINUE WRITE (*,FMT=9040) DO 100 j = 1,4 avgtime(j) = avgtime(j)/dble(ntimes-2) WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6, $ avgtime(j),mintime(j),maxtime(j) 100 CONTINUE PRINT *,'----------------------------------------------------' CALL checksums (a,b,c,n,ntimes) PRINT *,'----------------------------------------------------'

    9000 FORMAT (1x,a,i6,a) 9010 FORMAT (1x,a,i10) 9020 FORMAT (1x,a,i4,a) 9030 FORMAT (1x,a,i3,a,a) 9040 FORMAT ('Function',5x,'Rate (MB/s) Avg time Min time Max time' $ ) 9050 FORMAT (a,4 (f10.4,2x)) END

    *------------------------------------- * INTEGER FUNCTION dblesize() * * A semi-portable way to determine the precision of DOUBLE PRECISION * in Fortran. * Here used to guess how many bytes of storage a DOUBLE PRECISION * number occupies. * INTEGER FUNCTION realsize() * IMPLICIT NONE

    C .. Local Scalars .. DOUBLE PRECISION result,test INTEGER j,ndigits C .. C .. Local Arrays .. DOUBLE PRECISION ref(30) C .. C .. External Subroutines .. EXTERNAL confuse C .. C .. Intrinsic Functions .. INTRINSIC abs,acos,log10,sqrt C ..

    C Test #1 - compare single(1.0d0+delta) to 1.0d0

    10 DO 20 j = 1,30 ref(j) = 1.0d0 + 10.0d0** (-j) 20 CONTINUE

    DO 30 j = 1,30 test = ref(j) ndigits = j CALL confuse(test,result) IF (test.EQ.1.0D0) THEN GO TO 40 END IF 30 CONTINUE GO TO 50

    40 WRITE (*,FMT='(a)') $ '----------------------------------------------' WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ', $ ndigits,' digits of accuracy' IF (ndigits.LE.8) THEN realsize = 4 ELSE realsize = 8 END IF WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize, $ ' bytes per DOUBLE PRECISION word' WRITE (*,FMT='(a)') $ '----------------------------------------------' RETURN

    50 PRINT *,'Hmmmm. I am unable to determine the size.' PRINT *,'Please enter the number of Bytes per DOUBLE PRECISION', $ ' number : ' READ (*,FMT=*) realsize IF (realsize.NE.4 .AND. realsize.NE.8) THEN PRINT *,'Your answer ',realsize,' does not make sense.' PRINT *,'Try again.' PRINT *,'Please enter the number of Bytes per ', $ 'DOUBLE PRECISION number : ' READ (*,FMT=*) realsize END IF PRINT *,'You have manually entered a size of ',realsize, $ ' bytes per DOUBLE PRECISION number' WRITE (*,FMT='(a)') $ '----------------------------------------------' END

    SUBROUTINE confuse(q,r) * IMPLICIT NONE C .. Scalar Arguments .. DOUBLE PRECISION q,r C .. C .. Intrinsic Functions .. INTRINSIC cos C .. r = cos(q) RETURN END

    * A semi-portable way to determine the clock granularity * Adapted from a code by John Henning of Digital Equipment Corporation * INTEGER FUNCTION checktick() * IMPLICIT NONE

    C .. Parameters .. INTEGER n PARAMETER (n=20) C .. C .. Local Scalars .. DOUBLE PRECISION dummy,t1,t2 INTEGER i,j,jmin C .. C .. Local Arrays .. DOUBLE PRECISION timesfound(n) C .. C .. External Functions .. DOUBLE PRECISION second EXTERNAL second C .. C .. Intrinsic Functions .. INTRINSIC max,min,nint C .. i = 0 dummy = 0.0d0 t1 = second(dummy)

    10 t2 = second(dummy) IF (t2.EQ.t1) GO TO 10

    t1 = t2 i = i + 1 timesfound(i) = t1 IF (i.LT.n) GO TO 10

    jmin = 1000000 DO 20 i = 2,n j = nint((timesfound(i)-timesfound(i-1))*1d6) jmin = min(jmin,max(j,0)) 20 CONTINUE

    IF (jmin.GT.0) THEN checktick = jmin ELSE PRINT *,'Your clock granularity appears to be less ', $ 'than one microsecond' checktick = 1 END IF RETURN

    * PRINT 14, timesfound(1)*1d6 * DO 20 i=2,n * PRINT 14, timesfound(i)*1d6, * & nint((timesfound(i)-timesfound(i-1))*1d6) * 14 FORMAT (1X, F18.4, 1X, i8) * 20 CONTINUE

    END

    SUBROUTINE checksums(a,b,c,n,ntimes) * IMPLICIT NONE C .. C .. Arguments .. DOUBLE PRECISION a(*),b(*),c(*) INTEGER n,ntimes C .. C .. Local Scalars .. DOUBLE PRECISION aa,bb,cc,scalar,suma,sumb,sumc,epsilon INTEGER k C ..

    C Repeat the main loop, but with scalars only. C This is done to check the sum & make sure all C iterations have been executed correctly.

    aa = 2.0D0 bb = 0.5D0 cc = 0.0D0 aa = 0.5D0*aa scalar = 0.5d0*aa DO k = 1,ntimes cc = aa bb = scalar*cc cc = aa + bb aa = bb + scalar*cc END DO aa = aa*DBLE(n-2) bb = bb*DBLE(n-2) cc = cc*DBLE(n-2)

    C Now sum up the arrays, excluding the first and last C elements, which are modified using the timing results C to confuse aggressive optimizers.

    suma = 0.0d0 sumb = 0.0d0 sumc = 0.0d0 !$OMP PARALLEL DO REDUCTION(+:suma,sumb,sumc) DO 110 j = 2,n-1 suma = suma + a(j) sumb = sumb + b(j) sumc = sumc + c(j) 110 CONTINUE

    epsilon = 1.D-6

    IF (ABS(suma-aa)/suma .GT. epsilon) THEN PRINT *,'Failed Validation on array a()' PRINT *,'Target Sum of a is = ',aa PRINT *,'Computed Sum of a is = ',suma ELSEIF (ABS(sumb-bb)/sumb .GT. epsilon) THEN PRINT *,'Failed Validation on array b()' PRINT *,'Target Sum of b is = ',bb PRINT *,'Computed Sum of b is = ',sumb ELSEIF (ABS(sumc-cc)/sumc .GT. epsilon) THEN PRINT *,'Failed Validation on array c()' PRINT *,'Target Sum of c is = ',cc PRINT *,'Computed Sum of c is = ',sumc ELSE PRINT *,'Solution Validates!' ENDIF

    END

    double precision function second(dummy) double precision dummy call system_clock(icount,irate) second=dble(icount)/dble(irate) end

    Script started on Thu Mar 1 15:29:46 2001 [winfrid@p4 ~]$ uname -a Linux p4 2.4.1-ac10 #3 SMP Mon Feb 12 16:04:12 CET 2001 i686 unknown [winfrid@p4 ~]$ pgf90 -Minfo -fast stream_d.f -o stream_d stream: 111, Loop unrolled 10 times 118, Loop unrolled 8 times 136, Loop unrolled 10 times 146, Loop unrolled 8 times 156, Loop unrolled 5 times 166, Loop unrolled 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Loop unrolled 3 times 388, Loop unrolled 2 times /usr/pgi/linux86/lib/libpgf902.a(cnfg.o): In function `__hpfio_scratch_name': cnfg.o(.text+0x2a): the use of `tempnam' is dangerous, better use `mkstemp' [winfrid@p4 ~]$ time ./stream_d ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 2000000 Offset = 0 The total memory requirement is 45 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1230.7692 0.0260 0.0260 0.0260 Scale: 1280.0000 0.0259 0.0250 0.0260 Add: 1500.0000 0.0326 0.0320 0.0330 Triad: 1500.0000 0.0325 0.0320 0.0330 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 1.250u 0.090s 0:01.34 100.0% 0+0k 0+0io 141pf+0w [winfrid@p4 ~]$ pgf90 -Minfo -fast -Mvect=prefetch stream_d.f -o stream_d-prefetch stream: 111, Unrolling inner loop 4 times 118, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 136, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 146, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 156, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 2 loads 166, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 2 loads 175, Unrolling inner loop 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Unrolling inner loop 4 times 388, Unrolling inner loop 4 times Generated prefetch instructions for 3 loads /usr/pgi/linux86/lib/libpgf902.a(cnfg.o): In function `__hpfio_scratch_name': cnfg.o(.text+0x2a): the use of `tempnam' is dangerous, better use `mkstemp' [winfrid@p4 ~]$ time ./stream_d-prefetch ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 2000000 Offset = 0 The total memory requirement is 45 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2133.3333 0.0154 0.0150 0.0160 Scale: 2000.0000 0.0166 0.0160 0.0170 Add: 2086.9565 0.0231 0.0230 0.0240 Triad: 2086.9565 0.0233 0.0230 0.0240 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 0.850u 0.100s 0:00.95 100.0% 0+0k 0+0io 141pf+0w [winfrid@p4 ~]$ pgf90 -Minfo -fast -Mvect=sse stream_d.f -o stream_d-sse stream: 111, Generating sse code for inner loop 118, Generating sse code for inner loop Generated prefetch instructions for 1 loads 136, Generating sse code for inner loop Generated prefetch instructions for 1 loads 146, Generating sse code for inner loop Generated prefetch instructions for 1 loads 156, Generating sse code for inner loop Generated prefetch instructions for 2 loads 166, Generating sse code for inner loop Generated prefetch instructions for 2 loads 175, Unrolling inner loop 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Unrolling inner loop 4 times 388, Unrolling inner loop 4 times Generated prefetch instructions for 3 loads /usr/pgi/linux86/lib/libpgf902.a(cnfg.o): In function `__hpfio_scratch_name': cnfg.o(.text+0x2a): the use of `tempnam' is dangerous, better use `mkstemp' [winfrid@p4 ~]$ time ./stream_d-sse ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 2000000 Offset = 0 The total memory requirement is 45 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2133.3333 0.0155 0.0150 0.0160 Scale: 2133.3333 0.0156 0.0150 0.0160 Add: 2181.8182 0.0226 0.0220 0.0230 Triad: 2181.8182 0.0225 0.0220 0.0230 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 0.830u 0.100s 0:00.93 100.0% 0+0k 0+0io 142pf+0w [winfrid@p4 ~]$ vi stream_d.f [winfrid@p4 ~]$ pgf90 -Minfo -fast stream_d.f -o stream_d stream: 111, Loop unrolled 10 times 118, Loop unrolled 8 times 136, Loop unrolled 10 times 146, Loop unrolled 8 times 156, Loop unrolled 5 times 166, Loop unrolled 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Loop unrolled 3 times 388, Loop unrolled 2 times /usr/pgi/linux86/lib/libpgf902.a(cnfg.o): In function `__hpfio_scratch_name': cnfg.o(.text+0x2a): the use of `tempnam' is dangerous, better use `mkstemp' [winfrid@p4 ~]$ pgf90 -Minfo -fast -Mvect=prefetch stream_d.f -o stream_d-prefetch stream: 111, Unrolling inner loop 4 times 118, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 136, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 146, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 156, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 2 loads 166, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 2 loads 175, Unrolling inner loop 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Unrolling inner loop 4 times 388, Unrolling inner loop 4 times Generated prefetch instructions for 3 loads /usr/pgi/linux86/lib/libpgf902.a(cnfg.o): In function `__hpfio_scratch_name': cnfg.o(.text+0x2a): the use of `tempnam' is dangerous, better use `mkstemp' [winfrid@p4 ~]$ pgf90 -Minfo -fast -Mvect=sse stream_d.f -o stream_d-sse stream: 111, Generating sse code for inner loop 118, Generating sse code for inner loop Generated prefetch instructions for 1 loads 136, Generating sse code for inner loop Generated prefetch instructions for 1 loads 146, Generating sse code for inner loop Generated prefetch instructions for 1 loads 156, Generating sse code for inner loop Generated prefetch instructions for 2 loads 166, Generating sse code for inner loop Generated prefetch instructions for 2 loads 175, Unrolling inner loop 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Unrolling inner loop 4 times 388, Unrolling inner loop 4 times Generated prefetch instructions for 3 loads /usr/pgi/linux86/lib/libpgf902.a(cnfg.o): In function `__hpfio_scratch_name': cnfg.o(.text+0x2a): the use of `tempnam' is dangerous, better use `mkstemp' [winfrid@p4 ~]$ time ./stream_d ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 12000000 Offset = 0 The total memory requirement is 274 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1222.9299 0.1574 0.1570 0.1580 Scale: 1238.7097 0.1554 0.1550 0.1560 Add: 1476.9231 0.1957 0.1950 0.1960 Triad: 1469.3878 0.1960 0.1960 0.1960 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 7.420u 0.630s 0:08.04 100.1% 0+0k 0+0io 141pf+0w [winfrid@p4 ~]$ time ./stream_d-prefetch ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 12000000 Offset = 0 The total memory requirement is 274 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2042.5532 0.0940 0.0940 0.0940 Scale: 1939.3939 0.0994 0.0990 0.1000 Add: 2086.9565 0.1386 0.1380 0.1390 Triad: 2086.9565 0.1384 0.1380 0.1390 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 5.120u 0.590s 0:05.70 100.1% 0+0k 0+0io 141pf+0w [winfrid@p4 ~]$ time ./stream_d-sse ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 12000000 Offset = 0 The total memory requirement is 274 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2064.5161 0.0937 0.0930 0.0940 Scale: 2086.9565 0.0929 0.0920 0.0930 Add: 2149.2537 0.1348 0.1340 0.1350 Triad: 2133.3333 0.1354 0.1350 0.1360 ---------------------------------------------------- Solution Validates! ---------------------------------------------------- 5.000u 0.590s 0:05.58 100.1% 0+0k 0+0io 142pf+0w [winfrid@p4 ~]$ exit

    Script done on Thu Mar 1 15:34:26 2001

    PGI$ pgf90 -fast -Minfo -Mvect=sse stream_d.f -o stream_d_w2k-sse.exe

    stream: 111, Generating sse code for inner loop 118, Generating sse code for inner loop Generated prefetch instructions for 1 loads 136, Generating sse code for inner loop Generated prefetch instructions for 1 loads 146, Generating sse code for inner loop Generated prefetch instructions for 1 loads 156, Generating sse code for inner loop Generated prefetch instructions for 2 loads 166, Generating sse code for inner loop Generated prefetch instructions for 2 loads 175, Unrolling inner loop 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Unrolling inner loop 4 times 388, Unrolling inner loop 4 times Generated prefetch instructions for 3 loads PGI$ time stream_d_w2k-sse.exe ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 12000000 Offset = 0 The total memory requirement is 274 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2370.3704 0.0889 0.0810 0.0900 Scale: 2400.0000 0.0876 0.0800 0.0910 Add: 2232.5581 0.1325 0.1290 0.1410 Triad: 2215.3846 0.1329 0.1300 0.1400 ---------------------------------------------------- Solution Validates! ----------------------------------------------------

    real 0m5.428s user 0m0.010s sys 0m0.010s

    PGI$ pgf90 -fast -Minfo stream_d.f -o stream_d_w2k.exe stream: 111, Loop unrolled 10 times 118, Loop unrolled 8 times 136, Loop unrolled 10 times 146, Loop unrolled 8 times 156, Loop unrolled 5 times 166, Loop unrolled 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Loop unrolled 3 times 388, Loop unrolled 2 times PGI$ time stream_d_w2k.exe ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 12000000 Offset = 0 The total memory requirement is 274 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 1476.9231 0.1391 0.1300 0.1410 Scale: 1476.9231 0.1340 0.1300 0.1410 Add: 1523.8095 0.1911 0.1890 0.2000 Triad: 1515.7895 0.1916 0.1900 0.2010 ---------------------------------------------------- Solution Validates! ----------------------------------------------------

    real 0m7.490s user 0m0.010s sys 0m0.000s

    pgf90 -fast -Minfo -Mvect=prefetch stream_d.f -o stream_d_w2k-prefetch.exe

    stream: 111, Unrolling inner loop 4 times 118, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 136, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 146, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 1 loads 156, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 2 loads 166, Unrolling inner loop 4 times Used streaming stores for 1 stores Generated prefetch instructions for 2 loads 175, Unrolling inner loop 4 times 176, Loop unrolled 4 times (completely unrolled) checksums: 370, Unrolling inner loop 4 times 388, Unrolling inner loop 4 times Generated prefetch instructions for 3 loads PGI$ time stream_d_w2k-prefetch.exe ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 12000000 Offset = 0 The total memory requirement is 274 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2400.0000 0.0878 0.0800 0.0910 Scale: 2157.3034 0.0986 0.0890 0.1000 Add: 2215.3846 0.1391 0.1300 0.1410 Triad: 2215.3846 0.1364 0.1300 0.1410 ---------------------------------------------------- Solution Validates! ----------------------------------------------------

    real 0m5.598s user 0m0.010s sys 0m0.000s PGI$ time stream_d_w2k-prefetch.exe ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 12000000 Offset = 0 The total memory requirement is 274 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity appears to be less than one microsecond Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 2400.0000 0.0874 0.0800 0.0900 Scale: 2133.3333 0.0967 0.0900 0.1010 Add: 2057.1429 0.1401 0.1400 0.1410 Triad: 2215.3846 0.1378 0.1300 0.1410 ---------------------------------------------------- Solution Validates! ----------------------------------------------------

    real 0m5.568s user 0m0.010s sys 0m0.000s



    This archive was generated by hypermail 2b29 : Mon Apr 23 2001 - 09:29:54 CDT