[Fwd: DS10 fixup]

From: John McCalpin (mccalpin@austin.ibm.com)
Date: Wed Jun 28 2000 - 10:55:36 CDT

Next message: Wei Lin: "Stream results Super7 platform: K6-2, MII, mP6, PentiumMMX"

Previous message: ISOBE, Michiro: "STREAM results of tuned Power Mac G4."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

-- 
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Scientist           IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

attached mail follows:

Hi John,

Two problems: (1) Your web site currently has two different values for the DS10 and DS10L. This should not be true. (2) Furthermore, the values I've measured with the latest rev DS10 do not match EITHER of the two values you have posted.

These confusing results stimulated me to do a massive number of experiments - on various different combinations of system, firmware, software, memory config, etc.

I believe the attached is a more accurate representation of the currently shipping product than either of the existing postings. If you are willing, I'd suggest calling it:

Compaq_AlphaServer_DS10/DS10L

or if space does not permit the above, then how about

Compaq_Alpha_DS10/DS10L

Please *remove* the result that I submitted 5/26/99, http://www.cs.virginia.edu/stream/stream_mail/1999/0026.html . Or if you don't like removing old history, you could move it to the experimental table or somehow rename it with something like "older rev" or "obsolete firmware".

The other existing result is from Greg Lindahl. Greg wrote in http://www.cs.virginia.edu/stream/stream_mail/2000/0008.html that he thought that his slower result was due to hardware. Actually, it's probably due to the fact that he was using Linux, and the compiler version he had may have been slightly less aggressive in its use of prefecthing. It's not hardware; I have it from an authoritative member of the engineering team that the DS10 and DS10L memory systems are the same, in all aspects except capacity. The former can hold 2GB; the latter can hold only 1GB. So perhaps you could mark Greg's result as something like

Compaq_Alpha_DS10_Linux

You'll note that the source below is identical to the source for the recent ES40 submissions; I just compiled it WITHOUT "-omp", so those parallel directives are treated as comments.

Thanks - John Henning

Script started on Tue Jun 27 05:00:51 2000 % /usr/sbin/psrinfo -v Status of processor 0 as of: 06/27/00 05:00:55 Processor has been on-line since 06/21/2000 11:15:15 The alpha EV6 (21264) processor operates at 463 MHz, and has an alpha internal floating point processor. % diff stream_d.f_as_at_ftp_site_22may97 mcc_omp.f 0a1,4 > * this version 25-May-2000 j. henning compaq > * Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per > * suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK > 52,53c56,57 < INTEGER n,offset,ndim,ntimes < PARAMETER (n=2000000,offset=0,ndim=n+offset,ntimes=10)

---
>       INTEGER*8 n,offset,ndim,ntimes,maxtimes
>       PARAMETER (maxtimes=10000)
57c61
<       INTEGER j,k,nbpw,quantum
---
>       INTEGER*8 j,k,nbpw,quantum
60,62c64,66
<       DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),sum(3),
<      $                 times(4,ntimes)
<       INTEGER bytes(4)
---
>       DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),
>      $                 sum1,sum2,sum3,times(4,maxtimes),avgbw
>       INTEGER*8 bytes(4)
75c79
<       DOUBLE PRECISION a(ndim),b(ndim),c(ndim)
---
>       REAL*8, ALLOCATABLE:: a(:),b(:),c(:)
88a93,98
>       PRINT *, "n, offset, ntimes"
>       READ *, n, offset, ntimes
>       ndim=n+offset
>       IF (ntimes .GT. maxtimes) ntimes=maxtimes
>       ALLOCATE (a(ndim), b(ndim), c(ndim))
>       CALL defend_wrap
94c104
<      $  3*nbpw*n/ (1024*1024),' MB'
---
>      $  3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB'
97a108
> c$omp parallel do
103a115
> c$omp parallel do
128a141
> c$omp parallel do
135a149
> c$omp parallel do
142a157
> c$omp parallel do
149a165
> c$omp parallel do
165a182
>       avgbw = 0
169a187
>           avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6
171,173c189,193
<       sum(1) = 0.0d0
<       sum(2) = 0.0d0
<       sum(3) = 0.0d0
---
>       WRITE (*,FMT=9050) "AvgBW:     ", avgbw/4D0
>       sum1 = 0.0d0
>       sum2 = 0.0d0
>       sum3 = 0.0d0
> c$omp parallel do
175,177c195,197
<           sum(1) = sum(1) + a(j)
<           sum(2) = sum(2) + b(j)
<           sum(3) = sum(3) + c(j)
---
>           sum1 = sum1 + a(j)
>           sum2 = sum2 + b(j)
>           sum3 = sum3 + c(j)
179,181c199,201
<       PRINT *,'Sum of a is = ',sum(1)
<       PRINT *,'Sum of b is = ',sum(2)
<       PRINT *,'Sum of c is = ',sum(3)
---
>       PRINT *,'Sum of a is = ',sum1
>       PRINT *,'Sum of b is = ',sum2
>       PRINT *,'Sum of c is = ',sum3
185c205
<  9020 FORMAT (1x,a,i4,a)
---
>  9020 FORMAT (1x,a,f10.2,a)
334a355,374
>       END
> 
>       SUBROUTINE defend_wrap
>       INTEGER count, count_rate, count_max
>       CALL SYSTEM_CLOCK ( count, count_rate, count_max )
>       IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN
>          PRINT *,"Oops, this code won't handle a wrapping system_clock"
>          PRINT *,"and soon we will wrap."
>          PRINT 4, "count:", count, "count_max:", count_max
>   4      FORMAT (1X, A10, I16)
>          PRINT *,"Try again later, or fix the code to handle wraps."
>          PRINT *,"(The counter wraps approx once every 60 hours)"
>          STOP
>       END IF
>       END
> 
>       DOUBLE PRECISION FUNCTION second
>       INTEGER count, count_rate, count_max
>       CALL SYSTEM_CLOCK ( count, count_rate, count_max )
>       second = DBLE(count)/DBLE(count_rate)
% cat !$
% cat mcc_omp.f
* this version 25-May-2000 j. henning compaq
* Arrays are ALLOCATABLE; OpenMP syntax has been simplified (per
* suggestion from John McCalpin); timer is Fortran-90 SYSTEM_CLOCK
* Program: Stream
* Programmer: John D. McCalpin
* Revision: 4.1, June 4, 1996
*
* This program measures memory transfer rates in MB/s for simple
* computational kernels coded in Fortran.  These numbers reveal the
* quality of code generation for simple uncacheable kernels as well
* as showing the cost of floating-point operations relative to memory
* accesses.
*
*=========================================================================
* INSTRUCTIONS:
*       1) Stream requires a cpu timing function called second().
*          A sample is shown below.  This is unfortunately rather
*          system dependent.  The code attempts to determine the
*          granularity of the clock to help interpret the results.
*          For dedicated or parallel runs, you might want to comment
*          these out and compile/link with "wallclock.c".
*       2) Stream requires a good bit of memory to run.
*          Adjust the Parameter 'N' in the main program to give
*          a 'timing calibration' of at least 20 clicks.
*          This will provide rate estimates that should be good to
*          about 5% precision.
*          ------------------------------------------------------------
*          Note that you are free to use any array length and offset
*          that makes each array larger than the last-level cache.
*          The intent is to determine the *best* sustainable bandwidth
*          available with this simple coding.  Of course, lower values
*          are usually fairly easy to obtain on cached machines, but 
*          by keeping the test to the *best* results, the answers are
*          easier to interpret.
*          You may put the arrays in common or not, at your discretion.
*          There is a commented-out COMMON statement below.
*          ------------------------------------------------------------
*       3) Compile the code with full optimization.  Many compilers
*          generate unreasonably bad code before the optimizer tightens
*          things up.  If the results are unreasonably good, on the
*          other hand, the optimizer might be too smart for me
*          Please let me know if this happens.
*       4) Mail the results to mccalpin@cs.virginia.edu
*          Be sure to include:
*               a) computer hardware model number and software revision
*               b) the compiler flags
*               c) all of the output from the test case.
*
* Thanks
*=========================================================================
*
      PROGRAM stream
*     IMPLICIT NONE
C     .. Parameters ..
      INTEGER*8 n,offset,ndim,ntimes,maxtimes
      PARAMETER (maxtimes=10000)
C     ..
C     .. Local Scalars ..
      DOUBLE PRECISION dummy,scalar,t
      INTEGER*8 j,k,nbpw,quantum
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION maxtime(4),mintime(4),rmstime(4),
     $                 sum1,sum2,sum3,times(4,maxtimes),avgbw
      INTEGER*8 bytes(4)
      CHARACTER label(4)*11
C     ..
C     .. External Functions ..
      DOUBLE PRECISION second
      INTEGER checktick,realsize
      EXTERNAL second,checktick,realsize
C     ..
C     .. Intrinsic Functions ..
C
      INTRINSIC dble,max,min,nint,sqrt
C     ..
C     .. Arrays in Common ..
      REAL*8, ALLOCATABLE:: a(:),b(:),c(:)
C     ..
C     .. Common blocks ..
*     COMMON a,b,c
C     ..
C     .. Data statements ..
      DATA rmstime/4*0.0D0/,mintime/4*1.0D+36/,maxtime/4*0.0D0/
      DATA label/'Copy:      ','Scale:     ','Add:       ',
     $     'Triad:     '/
      DATA bytes/2,2,3,3/,dummy/0.0d0/
C     ..
*       --- SETUP --- determine precision and check timing ---
      PRINT *, "n, offset, ntimes"
      READ *, n, offset, ntimes
      ndim=n+offset
      IF (ntimes .GT. maxtimes) ntimes=maxtimes
      ALLOCATE (a(ndim), b(ndim), c(ndim))
      CALL defend_wrap
      nbpw = realsize()
      WRITE (*,FMT=9010) 'Array size = ',n
      WRITE (*,FMT=9010) 'Offset     = ',offset
      WRITE (*,FMT=9020) 'The total memory requirement is ',
     $  3D0*nbpw*DBLE(n)/ (1024d0*1024),' MB'
      WRITE (*,FMT=9030) 'You are running each test ',ntimes,' times'
      WRITE (*,FMT=9030) 'The *best* time for each test is used'
c$omp parallel do
      DO 10 j = 1,n
          a(j) = 1.0d0
          b(j) = 2.0D0
          c(j) = 0.0D0
   10 CONTINUE
      t = second(dummy)
c$omp parallel do
      DO 20 j = 1,n
          a(j) = 2.0d0*a(j)
   20 CONTINUE
      t = second(dummy) - t
      PRINT *,'----------------------------------------------------'
      quantum = checktick()
      WRITE (*,FMT=9000)
     $  'Your clock granularity/precision appears to be ',quantum,
     $  ' microseconds'
      PRINT *,'The tests below will each take a time on the order '
      PRINT *,'of ',nint(t*1d6),' microseconds'
      PRINT *,'   (= ',nint((t*1d6)/quantum),' clock ticks)'
      PRINT *,'Increase the size of the arrays if this shows that'
      PRINT *,'you are not getting at least 20 clock ticks per test.'
      PRINT *,'----------------------------------------------------'
      PRINT *,'WARNING -- The above is only a rough guideline.'
      PRINT *,'For best results, please be sure you know the'
      PRINT *,'precision of your system timer.'
      PRINT *,'----------------------------------------------------'
*       --- MAIN LOOP --- repeat test cases NTIMES times ---
      scalar = 1.5d0*a(1)
      DO 70 k = 1,ntimes
          t = second(dummy)
c$omp parallel do
          DO 30 j = 1,n
              c(j) = a(j)
   30     CONTINUE
          t = second(dummy) - t
          times(1,k) = t
          t = second(dummy)
c$omp parallel do
          DO 40 j = 1,n
              b(j) = scalar*c(j)
   40     CONTINUE
          t = second(dummy) - t
          times(2,k) = t
          t = second(dummy)
c$omp parallel do
          DO 50 j = 1,n
              c(j) = a(j) + b(j)
   50     CONTINUE
          t = second(dummy) - t
          times(3,k) = t
          t = second(dummy)
c$omp parallel do
          DO 60 j = 1,n
              a(j) = b(j) + scalar*c(j)
   60     CONTINUE
          t = second(dummy) - t
          times(4,k) = t
   70 CONTINUE
*       --- SUMMARY ---
      DO 90 k = 1,ntimes
          DO 80 j = 1,4
              rmstime(j) = rmstime(j) + times(j,k)**2
              mintime(j) = min(mintime(j),times(j,k))
              maxtime(j) = max(maxtime(j),times(j,k))
   80     CONTINUE
   90 CONTINUE
      WRITE (*,FMT=9040)
      avgbw = 0
      DO 100 j = 1,4
          rmstime(j) = sqrt(rmstime(j)/dble(ntimes))
          WRITE (*,FMT=9050) label(j),n*bytes(j)*nbpw/mintime(j)/1.0D6,
     $      rmstime(j),mintime(j),maxtime(j)
          avgbw = avgbw+n*bytes(j)*nbpw/mintime(j)/1.0D6
  100 CONTINUE
      WRITE (*,FMT=9050) "AvgBW:     ", avgbw/4D0
      sum1 = 0.0d0
      sum2 = 0.0d0
      sum3 = 0.0d0
c$omp parallel do
      DO 110 j = 1,n
          sum1 = sum1 + a(j)
          sum2 = sum2 + b(j)
          sum3 = sum3 + c(j)
  110 CONTINUE
      PRINT *,'Sum of a is = ',sum1
      PRINT *,'Sum of b is = ',sum2
      PRINT *,'Sum of c is = ',sum3
 9000 FORMAT (1x,a,i6,a)
 9010 FORMAT (1x,a,i10)
 9020 FORMAT (1x,a,f10.2,a)
 9030 FORMAT (1x,a,i3,a,a)
 9040 FORMAT ('Function',5x,'Rate (MB/s)  RMS time   Min time  Max time'
     $       )
 9050 FORMAT (a,4 (f10.4,2x))
      END
*-------------------------------------
* INTEGER FUNCTION dblesize()
*
* A semi-portable way to determine the precision of DOUBLE PRECISION
* in Fortran.
* Here used to guess how many bytes of storage a DOUBLE PRECISION
* number occupies.
*
      INTEGER FUNCTION realsize()
*     IMPLICIT NONE
C     .. Local Scalars ..
      DOUBLE PRECISION result,test
      INTEGER j,ndigits
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION ref(30)
C     ..
C     .. External Subroutines ..
      EXTERNAL confuse
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC abs,acos,log10,sqrt
C     ..
C       Test #1 - compare single(1.0d0+delta) to 1.0d0
   10 DO 20 j = 1,30
          ref(j) = 1.0d0 + 10.0d0** (-j)
   20 CONTINUE
      DO 30 j = 1,30
          test = ref(j)
          ndigits = j
          CALL confuse(test,result)
          IF (test.EQ.1.0D0) THEN
              GO TO 40
          END IF
   30 CONTINUE
      GO TO 50
   40 WRITE (*,FMT='(a)')
     $  '----------------------------------------------'
      WRITE (*,FMT='(1x,a,i2,a)') 'Double precision appears to have ',
     $  ndigits,' digits of accuracy'
      IF (ndigits.LE.8) THEN
          realsize = 4
      ELSE
          realsize = 8
      END IF
      WRITE (*,FMT='(1x,a,i1,a)') 'Assuming ',realsize,
     $  ' bytes per DOUBLE PRECISION word'
      WRITE (*,FMT='(a)')
     $  '----------------------------------------------'
      RETURN
   50 PRINT *,'Hmmmm.  I am unable to determine the size.'
      PRINT *,'Please enter the number of Bytes per DOUBLE PRECISION',
     $  ' number : '
      READ (*,FMT=*) realsize
      IF (realsize.NE.4 .AND. realsize.NE.8) THEN
          PRINT *,'Your answer ',realsize,' does not make sense.'
          PRINT *,'Try again.'
          PRINT *,'Please enter the number of Bytes per ',
     $      'DOUBLE PRECISION number : '
          READ (*,FMT=*) realsize
      END IF
      PRINT *,'You have manually entered a size of ',realsize,
     $  ' bytes per DOUBLE PRECISION number'
      WRITE (*,FMT='(a)')
     $  '----------------------------------------------'
      END
      SUBROUTINE confuse(q,r)
*     IMPLICIT NONE
C     .. Scalar Arguments ..
      DOUBLE PRECISION q,r
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC cos
C     ..
      r = cos(q)
      RETURN
      END
* A semi-portable way to determine the clock granularity
* Adapted from a code by John Henning of Digital Equipment Corporation
*
      INTEGER FUNCTION checktick()
*     IMPLICIT NONE
C     .. Parameters ..
      INTEGER n
      PARAMETER (n=20)
C     ..
C     .. Local Scalars ..
      DOUBLE PRECISION dummy,t1,t2
      INTEGER i,j,jmin
C     ..
C     .. Local Arrays ..
      DOUBLE PRECISION timesfound(n)
C     ..
C     .. External Functions ..
      DOUBLE PRECISION second
      EXTERNAL second
C     ..
C     .. Intrinsic Functions ..
      INTRINSIC max,min,nint
C     ..
      i = 0
      dummy = 0.0d0
      t1 = second(dummy)
   10 t2 = second(dummy)
      IF (t2.EQ.t1) GO TO 10
      t1 = t2
      i = i + 1
      timesfound(i) = t1
      IF (i.LT.n) GO TO 10
      jmin = 1000000
      DO 20 i = 2,n
          j = nint((timesfound(i)-timesfound(i-1))*1d6)
          jmin = min(jmin,max(j,0))
   20 CONTINUE
      IF (jmin.GT.0) THEN
          checktick = jmin
      ELSE
          PRINT *,'Your clock granularity appears to be less ',
     $      'than one microsecond'
          checktick = 1
      END IF
      RETURN
*      PRINT 14, timesfound(1)*1d6
*      DO 20 i=2,n
*         PRINT 14, timesfound(i)*1d6,
*     &       nint((timesfound(i)-timesfound(i-1))*1d6)
*   14    FORMAT (1X, F18.4, 1X, i8)
*   20 CONTINUE
      END
      SUBROUTINE defend_wrap
      INTEGER count, count_rate, count_max
      CALL SYSTEM_CLOCK ( count, count_rate, count_max )
      IF (DBLE(count) .GT. .999*DBLE(count_max)) THEN
         PRINT *,"Oops, this code won't handle a wrapping system_clock"
         PRINT *,"and soon we will wrap."
         PRINT 4, "count:", count, "count_max:", count_max
  4      FORMAT (1X, A10, I16)
         PRINT *,"Try again later, or fix the code to handle wraps."
         PRINT *,"(The counter wraps approx once every 60 hours)"
         STOP
      END IF
      END
      DOUBLE PRECISION FUNCTION second
      INTEGER count, count_rate, count_max
      CALL SYSTEM_CLOCK ( count, count_rate, count_max )
      second = DBLE(count)/DBLE(count_rate)
      END
% cat buildit_noomp.csh
cat buildit_noomp.csh
#!/bin/csh
set verbose
unlimit
f90 -v -source_listing -machine_code \
  -o mcc_noomp_`date +%Y%m%d` \
  -fast -O5 -unroll 32 -arch ev6 \
  mcc_omp.f 
grep COMPILER: mcc_omp.lis
% 
% ./!$
% ./buildit_noomp.csh
unlimit 
f90 -v -source_listing -machine_code -o mcc_noomp_`date +%Y%m%d` -fast -O5 -unroll 32 -arch ev6 mcc_omp.f 
/usr/lib/cmplrs/fort90/decfort90 -machine_code -fast -O5 -unroll 32 -arch ev6 -I/usr/lib/cmplrs/hpfrtl -source_listing -o /tmp/forAAAabcuna.o mcc_omp.f 
/usr/bin/cc -v -o mcc_noomp_20000627 -arch ev6 /usr/lib/cmplrs/fort90/for_main.o -source_listing /tmp/forAAAabcuna.o -O4 -qlshpf -lUfor -lfor -lFutil -lm -lots -lm_c32 
/usr/lib/cmplrs/cc/ld -o mcc_noomp_20000627 -g0 -O4 -call_shared /usr/lib/cmplrs/cc/crt0.o /usr/lib/cmplrs/fort90/for_main.o /tmp/forAAAabcuna.o -qlshpf -lUfor -lfor -lFutil -lm -lots -lm_c32 -lc 
/usr/lib/cmplrs/cc/ld: 
0.01u 0.01s 0:00 40% 0+19k 0+12io 0pf+0w 19stk+2008mem
grep COMPILER: mcc_omp.lis 
COMPILER: Compaq Fortran V5.3-915-449BB
% 
% ls
buildit_noomp.csh                  mcc_omp.lis
mcc_noomp_20000627                 stream_d.f_as_at_ftp_site_22may97
mcc_omp.f                          typescript
% ./mcc_noomp_20000627
 n, offset, ntimes
1008075,0,10
----------------------------------------------
 Double precision appears to have 16 digits of accuracy
 Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
 Array size =    1008075
 Offset     =          0
 The total memory requirement is      23.07 MB
 You are running each test  10 times
 The *best* time for each test is used
 ----------------------------------------------------
 Your clock granularity/precision appears to be    100 microseconds
 The tests below will each take a time on the order 
 of        18100  microseconds
    (=          181  clock ticks)
 Increase the size of the arrays if this shows that
 you are not getting at least 20 clock ticks per test.
 ----------------------------------------------------
 WARNING -- The above is only a rough guideline.
 For best results, please be sure you know the
 precision of your system timer.
 ----------------------------------------------------
Function     Rate (MB/s)  RMS time   Min time  Max time
Copy:        806.4600      0.0202      0.0200      0.0208
Scale:       798.4752      0.0203      0.0202      0.0206
Add:         763.2114      0.0318      0.0317      0.0320
Triad:       777.9357      0.0313      0.0311      0.0317
AvgBW:       786.5206
 Sum of a is =   1.162613685089136E+018
 Sum of b is =   2.325227370154242E+017
 Sum of c is =   3.100303160218152E+017
% exit
% 
script done on Tue Jun 27 05:01:42 2000

Next message: Wei Lin: "Stream results Super7 platform: K6-2, MII, mP6, PentiumMMX"
Previous message: ISOBE, Michiro: "STREAM results of tuned Power Mac G4."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Jul 17 2000 - 04:46:07 CDT