7100/66 (Mac) Stream information & source.

From: markmi@mdhost.cse.TEK.COM
Date: Sat Mar 09 1996 - 20:51:40 CST


The following includes results and source changes to
support use on Macintoshes. Some source changes are
required for ANSI C portability: FLT_MAX need not be
a constant expression and is not in the Mac compilers
Also I've made the wall clock case check for __STDC__
and use ANSI facilities (clock()). Its wall vs cpu
placement may be questionable. Niether place may be fully
portable. As the Mac has no process specific time
available, I've done nohting with the cpu time source.

Some day I may get Mac 9500 @132MHz numbers.

Have fun.
m_m

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 73 microseconds.
Each test below will take on the order of 657952 microseconds.
   (= 9013 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 33.4510 0.9576 0.9566 0.9585
Scaling : 33.5360 0.9605 0.9542 0.9939
Summing : 37.0248 1.3055 1.2964 1.3654
SAXPYing : 36.9258 1.3036 1.2999 1.3122

7100/66, System 7.5.1

MetroWerks Code Warrier DR8 (IDE 1.5b2)

Compiler options:

Make Strings ReadOnly
Store Static Data in TOC
Use FMADD & FMSUB
601 Scheduleing
Peephole
Glocal Optimization level 4

Macintosh specific Microssecond() measurments.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 16666 microseconds.
Each test below will take on the order of 599999 microseconds.
   (= 36 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 35.5556 0.9117 0.9000 0.9167
Scaling : 34.9091 0.9167 0.9167 0.9167
Summing : 39.4521 1.2317 1.2167 1.2500
SAXPYing : 39.4521 1.2300 1.2167 1.2333

7100/66, System 7.5.1

MetroWerks Code Warrier DR8 (IDE 1.5b2)

Compiler options:

Make Strings ReadOnly
Store Static Data in TOC
Use FMADD & FMSUB
601 Scheduleing
Peephole
Glocal Optimization level 4

ANSI C "clock()" measurements.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1000000, Offset = 0
Total memory required = 22.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 78 microseconds.
Each test below will take on the order of 329707 microseconds.
   (= 4227 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 33.5670 0.4778 0.4767 0.4786
Scaling : 33.3568 0.4811 0.4797 0.4835
Summing : 36.8963 0.6522 0.6505 0.6538
SAXPYing : 36.7813 0.6569 0.6525 0.6628

7100/66, System 7.5.1

MetroWerks Code Warrier DR8 (IDE 1.5b2)

Compiler options:

Make Strings ReadOnly
Store Static Data in TOC
Use FMADD & FMSUB
601 Scheduleing
Peephole
Glocal Optimization level 4

Macintosh specific MicrosSecond() measurments.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 100000, Offset = 0
Total memory required = 2.3 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 74 microseconds.
Each test below will take on the order of 29806 microseconds.
   (= 402 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 35.7039 0.0462 0.0448 0.0496
Scaling : 35.5848 0.0465 0.0450 0.0487
Summing : 39.2247 0.0639 0.0612 0.0699
SAXPYing : 39.1805 0.0637 0.0613 0.0697

7100/66, System 7.5.1

Symantec C 8.0.4/MrC beta compiler

WARNING: Even with the small (100,000) size the compiler
                 generates a program requiring 36MB of RAM allocated
                 to it to run. The compiler also needed ~50MB RAM to
                 compile and link it.

Compiler options:

Use global optimization
Optimize for speed
Loop Unrolling
Copy Propagation
Interpoc Optimization
Enable Inlining
Inline Level 5

Macintosh specific Microssecond() measurments.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 75 microseconds.
Each test below will take on the order of 592060 microseconds.
   (= 7894 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 35.3272 0.9075 0.9058 0.9086
Scaling : 35.1498 0.9120 0.9104 0.9129
Summing : 39.1263 1.2294 1.2268 1.2309
SAXPYing : 39.1204 1.2290 1.2270 1.2310

7100/66, System 7.5.1

mcc DR2.0 (Motorola) with KAP

compiler options: -w -Apascalstr=1 -minsizenums=1 -Akap -O4 -sym off

Macintosh specific MicrosSecond() measurments.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1000000, Offset = 0
Total memory required = 22.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 75 microseconds.
Each test below will take on the order of 295976 microseconds.
   (= 3946 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 35.4256 0.4525 0.4517 0.4531
Scaling : 34.8912 0.4602 0.4586 0.4619
Summing : 38.8596 0.6191 0.6176 0.6202
SAXPYing : 38.8590 0.6197 0.6176 0.6211

7100/66, System 7.5.1

mcc DR2.0 (Motorola) with KAP

compiler options: -w -Apascalstr=1 -minsizenums=1 -Akap -O4 -sym off

Macintosh specific MicrosSecond() measurments.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1000000, Offset = 0
Total memory required = 22.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 76 microseconds.
Each test below will take on the order of 326039 microseconds.
   (= 4289 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Assignment: 35.0080 0.4615 0.4570 0.4875
Scaling : 34.8340 0.4698 0.4593 0.5267
Summing : 38.7526 0.6317 0.6193 0.6876
SAXPYing : 38.7772 0.6211 0.6189 0.6239

7100/66, System 7.5.1

Symantec C 8.0.4

Compiler options:

Use global optimzier
Optimize for time

Macintosh specific Microssecond() measurments.

/* A Fortran-callable gettimeofday routine to give access
   to the wall clock timer.

   This subroutine may need to be modified slightly to get
   it to link with Fortran on your computer.
*/

#if ((defined(__MWERKS__) || defined(__SC__) || defined(THINK_C)) && defined UNIVERSAL_INTERFACES_VERSION) || defined(__MOTO__)
#include <timer.h>
#include <limits.h>
double second()
{
        UnsignedWide usec; /* microseconds; system wide, not per process */
        
        Microseconds(&usec);
        /* expensive but clock() for Mac's is only ~1/60th of a second. */
        return 1.0e-6 * (usec.hi * ((double)ULONG_MAX + 1) + usec.lo);
}

#elif __STDC__ /* for ANSI C */
#include <time.h>
double second()
{
        return ( (double) clock() / (double) CLOCKS_PER_SEC ); /* ANSI C std */
}

#else /* UNIX */
#include <sys/time.h>
/* int gettimeofday(struct timeval *tp, struct timezone *tzp); */

double second()
{
/* struct timeval { long tv_sec;
            long tv_usec; };

struct timezone { int tz_minuteswest;
             int tz_dsttime; }; */

        struct timeval tp;
        struct timezone tzp;
        int i;

        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}
#endif

# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>

/*
 * Program: Stream
 * Programmer: Joe R. Zagar
 * Revision: 4.0-BETA, October 24, 1995
 * Original code developed by John D. McCalpin
 *
 * This program measures memory transfer rates in MB/s for simple
 * computational kernels coded in C. These numbers reveal the quality
 * of code generation for simple uncacheable kernels as well as showing
 * the cost of floating-point operations relative to memory accesses.
 *
 * INSTRUCTIONS:
 *
 * 1) Stream requires a good bit of memory to run. Adjust the
 * value of 'N' (below) to give a 'timing calibration' of
 * at least 20 clock-ticks. This will provide rate estimates
 * that should be good to about 5% precision.
 */

#if (defined(__SC__) || defined(THINK_C))
# define N 100000 /* Takes massive memory for integrated MrC compiler */
                                  /* Both compiler/linker (~50MB) and run time (~36MB). */
                                  /* sC compiler does not have the problem. */
#else
# define N 1000000
#endif
# define NTIMES 10
# define OFFSET 0

/*
 * 3) Compile the code with full optimization. Many compilers
 * generate unreasonably bad code before the optimizer tightens
 * things up. If the results are unreasonably good, on the
 * other hand, the optimizer might be too smart for me!
 *
 * Try compiling with:
 * cc -O stream_d.c second.c -o stream_d -lm
 *
 * This is known to work on Cray, SGI, IBM, and Sun machines.
 *
 *
 * 4) Mail the results to mccalpin@udel.edu
 * Be sure to include:
 * a) computer hardware model number and software revision
 * b) the compiler flags
 * c) all of the output from the test case.
 * Thanks!
 *
 */

# define HLINE "-------------------------------------------------------------\n"

# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif

static double a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

static double rmstime[4] = {0}, maxtime[4] = {0},
                mintime[4]; /* FLT_MAX not guaranteed to be a constant expression
                                         * by ANSI std. Also DBL_MAX for double, FLT_MAX for
                                         * float. So no = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX} here.
                                         */

static char *label[4] = {"Assignment:", "Scaling :",
    "Summing :", "SAXPYing :"};

static double bytes[4] = {
    2 * sizeof(double) * N,
    2 * sizeof(double) * N,
    3 * sizeof(double) * N,
    3 * sizeof(double) * N
    };

extern double second();

int
main()
    {
    int quantum, checktick();
    int BytesPerWord;
    register int j, k;
    double scalar, t, times[4][NTIMES];
        
        /* Here for ANSI C portability: */
        mintime[0] = DBL_MAX;
        mintime[1] = DBL_MAX;
        mintime[2] = DBL_MAX;
        mintime[3] = DBL_MAX;

    /* --- SETUP --- determine precision and check timing --- */

    printf(HLINE);
    BytesPerWord = sizeof(double);
    printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
        BytesPerWord);

    printf(HLINE);
    printf("Array size = %d, Offset = %d\n" , N, OFFSET);
    printf("Total memory required = %.1f MB.\n",
        (3 * N * BytesPerWord) / 1048576.0);
    printf("Each test is run %d times, but only\n", NTIMES);
    printf("the *best* time for each is used.\n");

    /* Get initial value for system clock. */

    for (j=0; j<N; j++) {
        a[j] = 1.0;
        b[j] = 2.0;
        c[j] = 0.0;
        }

    printf(HLINE);

    if ( (quantum = checktick()) >= 1)
        printf("Your clock granularity/precision appears to be "
            "%d microseconds.\n", quantum);
    else
        printf("Your clock granularity appears to be "
            "less than one microsecond.\n");

    t = second();
    for (j = 0; j < N; j++)
        a[j] = 2.0E0 * a[j];
    t = 1.0E6 * (second() - t);

    printf("Each test below will take on the order"
        " of %d microseconds.\n", (int) t );
    printf(" (= %d clock ticks)\n", (int) (t/quantum) );
    printf("Increase the size of the arrays if this shows that\n");
    printf("you are not getting at least 20 clock ticks per test.\n");

    printf(HLINE);

    printf("WARNING: The above is only a rough guideline.\n");
    printf("For best results, please be sure you know the\n");
    printf("precision of your system timer.\n");
    printf(HLINE);
    
    /* --- MAIN LOOP --- repeat test cases NTIMES times --- */

    scalar = 3.0;
    for (k=0; k<NTIMES; k++)
        {
        times[0][k] = second();
        for (j=0; j<N; j++)
            c[j] = a[j];
        times[0][k] = second() - times[0][k];
        
        times[1][k] = second();
        for (j=0; j<N; j++)
            b[j] = scalar*c[j];
        times[1][k] = second() - times[1][k];
        
        times[2][k] = second();
        for (j=0; j<N; j++)
            c[j] = a[j]+b[j];
        times[2][k] = second() - times[2][k];
        
        times[3][k] = second();
        for (j=0; j<N; j++)
            a[j] = b[j]+scalar*c[j];
        times[3][k] = second() - times[3][k];
        }
    
    /* --- SUMMARY --- */

    for (k=0; k<NTIMES; k++)
        {
        for (j=0; j<4; j++)
            {
            rmstime[j] = rmstime[j] + (times[j][k] * times[j][k]);
            mintime[j] = MIN(mintime[j], times[j][k]);
            maxtime[j] = MAX(maxtime[j], times[j][k]);
            }
        }
    
    printf("Function Rate (MB/s) RMS time Min time Max time\n");
    for (j=0; j<4; j++) {
        rmstime[j] = sqrt(rmstime[j]/(double)NTIMES);

        printf("%s%11.4f %11.4f %11.4f %11.4f\n", label[j],
               1.0E-06 * bytes[j]/mintime[j],
               rmstime[j],
               mintime[j],
               maxtime[j]);
    }
    return 0;
}

# define M 20

int
checktick()
    {
    int i, minDelta, Delta;
    double t1, t2, timesfound[M];

/* Collect a sequence of M unique time values from the system. */

    for (i = 0; i < M; i++) {
        t1 = second();
        while( ((t2=second()) - t1) < 1.0E-6 )
            ;
        timesfound[i] = t1 = t2;
        }

/*
 * Determine the minimum difference between these M values.
 * This result will be our estimate (in microseconds) for the
 * clock granularity.
 */

    minDelta = 1000000;
    for (i = 1; i < M; i++) {
        Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
        minDelta = MIN(minDelta, MAX(Delta,0));
        }

    return(minDelta);
    }



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:05 CDT