Re: Parallel Streams

From: Isom L. Crawford (isom@hydra.convex.com)
Date: Thu Mar 28 1996 - 10:41:47 CST


John D. McCalpin Wrote:
>
> > What, if any, are your preferences for obtaining parallel results?
>
> I greatly prefer automatically parallelized results. If that
> is not possible, my second choice is manually parallelized
> results. Failing that, I accept aggregate results from
> the Parallel_jobs script, which is in the same directory as
> the rest of the STREAM code.

Greetings again,

For the new SPP1600, here are times for manually parallelized
streams (source and makefile attached for your perusal). I am curious
as to what your policy is regarding clusters and/or message passing.
Clusters of workstations can usually scale without bound on highly
parallel codes.

Each result is from a single process (using lightweight threads of course)
and shared memory. If you have any questions, etc. please don't hesitate
to send me email. I hope these results are suitable for your WEB page.

Thanks,
Isom Crawford
HP Convex Technology Center

Detailed output follows (long):

================================== 1 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 3223797 microseconds.
   (= 115135 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 1 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 123.1902 3.3539 3.2470 4.1907
Scaling : 121.1729 3.4095 3.3011 4.2600
Summing : 156.2537 3.8406 3.8399 3.8411
SAXPYing : 156.1776 3.8426 3.8418 3.8432
================================== 4 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2881791 microseconds.
   (= 102921 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 4 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 360.9111 1.2396 1.1083 1.4904
Scaling : 356.7705 1.3061 1.1212 1.8792
Summing : 455.8100 1.5042 1.3163 2.0646
SAXPYing : 456.0660 1.4388 1.3156 1.5524
================================== 8 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2941454 microseconds.
   (= 105051 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 8 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 442.1536 0.9316 0.9047 1.1391
Scaling : 437.3879 0.9670 0.9145 1.3468
Summing : 536.7600 1.1191 1.1178 1.1233
SAXPYing : 535.3591 1.1227 1.1207 1.1257
================================= 12 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 29 microseconds.
Each test below will take on the order of 2769588 microseconds.
   (= 95503 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 12 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 665.9309 0.6387 0.6007 0.9087
Scaling : 658.3354 0.6339 0.6076 0.8263
Summing : 807.5023 0.7441 0.7430 0.7462
SAXPYing : 803.7552 0.7479 0.7465 0.7518
================================= 16 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2776748 microseconds.
   (= 99169 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 16 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 878.5454 0.5988 0.4553 1.3092
Scaling : 862.2159 0.5140 0.4639 0.8341
Summing : 1068.6442 0.5624 0.5615 0.5636
SAXPYing : 1063.5205 0.5657 0.5642 0.5667
================================= 20 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2705785 microseconds.
   (= 96635 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 20 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1100.3250 0.4621 0.3635 0.9708
Scaling : 1082.5879 0.4069 0.3695 0.6396
Summing : 1334.8134 0.4499 0.4495 0.4503
SAXPYing : 1322.2852 0.4543 0.4538 0.4550
================================= 24 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2668254 microseconds.
   (= 95294 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 24 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1316.0965 0.3999 0.3039 0.8740
Scaling : 1283.5896 0.3509 0.3116 0.5947
Summing : 1599.8805 0.3754 0.3750 0.3759
SAXPYing : 1599.9832 0.3794 0.3750 0.3821
================================= 28 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2597999 microseconds.
   (= 92785 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 28 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1563.3857 0.3356 0.2559 0.7306
Scaling : 1524.0363 0.2947 0.2625 0.4924
Summing : 1886.1104 0.3188 0.3181 0.3197
SAXPYing : 1860.2286 0.3228 0.3225 0.3230
================================= 32 CPU ==================================
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 25000000, Offset = 0
Total memory required = 572.2 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 28 microseconds.
Each test below will take on the order of 2634035 microseconds.
   (= 94072 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING: The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Spawning 32 threads.
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1777.6834 0.3055 0.2250 0.6900
Scaling : 1721.8297 0.2649 0.2323 0.4614
Summing : 2158.8011 0.2791 0.2779 0.2843
SAXPYing : 2164.6038 0.2824 0.2772 0.2841

================================ SOURCE ==============================
------------------------------- streams.h ------------------------------
# define N 25000000
# define NTIMES 10
# define OFFSET 0
-------------------------------- main.c ------------------------------
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <sys/time.h>

/* Include Convex parallel support library */
# include <cps.h>
# include <spp_prog_model.h>
barrier_t b1;
int total_threads, num_args, total_nodes;
spawn_sym_t cnx_sp= { CPS_ANY_NODE, 1, 1, CPS_THREAD_PARALLEL };
void kernel();
int retval;
#include "streams.h"

/*
 * Program: Stream
 * Programmer: Joe R. Zagar
 * Revision: 4.0-BETA, October 24, 1995
 * Original code developed by John D. McCalpin
 *
 * This program measures memory transfer rates in MB/s for simple
 * computational kernels coded in C. These numbers reveal the quality
 * of code generation for simple uncacheable kernels as well as showing
 * the cost of floating-point operations relative to memory accesses.
 *
 * INSTRUCTIONS:
 *
 * 1) Stream requires a good bit of memory to run. Adjust the
 * value of 'N' (below) to give a 'timing calibration' of
 * at least 20 clock-ticks. This will provide rate estimates
 * that should be good to about 5% precision.
 */

/*
 * 3) Compile the code with full optimization. Many compilers
 * generate unreasonably bad code before the optimizer tightens
 * things up. If the results are unreasonably good, on the
 * other hand, the optimizer might be too smart for me!
 *
 * Try compiling with:
 * cc -O stream_d.c second.c -o stream_d -lm
 *
 * This is known to work on Cray, SGI, IBM, and Sun machines.
 *
 *
 * 4) Mail the results to mccalpin@udel.edu
 * Be sure to include:
 * a) computer hardware model number and software revision
 * b) the compiler flags
 * c) all of the output from the test case.
 * Thanks!
 *
 */

# define HLINE "-------------------------------------------------------------\n"

# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif

static node_private double a[N+OFFSET],
                b[N+OFFSET],
                c[N+OFFSET];

static double rmstime[4] = {0}, maxtime[4] = {0},
                mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};

static char *label[4] = {"Assignment:", "Scaling :",
    "Summing :", "SAXPYing :"};

static double bytes[4] = {
    2 * sizeof(double) * N,
    2 * sizeof(double) * N,
    3 * sizeof(double) * N,
    3 * sizeof(double) * N
    };

double second();
double times[4][NTIMES];

int
main(argc,argv)
int argc;
char **argv;
    {
    int quantum, checktick();
    int BytesPerWord;
    register int j, k;
    double scalar, t;

    if( argc != 2 )
    {
            fprintf(stderr,"Usage: par_stream_d <#threads>\n");
            exit(-1);
    }

    total_threads= atoi( argv[1] );

    /* --- SETUP --- determine precision and check timing --- */

    printf(HLINE);
    BytesPerWord = sizeof(double);
    printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
        BytesPerWord);

    printf(HLINE);
    printf("Array size = %d, Offset = %d\n" , N, OFFSET);
    printf("Total memory required = %.1f MB.\n",
        (3 * N * BytesPerWord) / 1048576.0);
    printf("Each test is run %d times, but only\n", NTIMES);
    printf("the *best* time for each is used.\n");

    /* Get initial value for system clock. */

    total_nodes= cps_complex_nodes();
# pragma _CNX loop_parallel(nodes,ivar=k)
    for (k=0; k<total_nodes; k++)
            for (j=0; j<N; j++) {
                a[j] = 1.0;
                b[j] = 2.0;
                c[j] = 0.0;
                }

    printf(HLINE);

    if ( (quantum = checktick()) >= 1)
        printf("Your clock granularity/precision appears to be "
            "%d microseconds.\n", quantum);
    else
        printf("Your clock granularity appears to be "
            "less than one microsecond.\n");

    t = second();
    for (j = 0; j < N; j++)
        a[j] = 2.0E0 * a[j];
    t = 1.0E6 * (second() - t);

    printf("Each test below will take on the order"
        " of %d microseconds.\n", (int) t );
    printf(" (= %d clock ticks)\n", (int) (t/quantum) );
    printf("Increase the size of the arrays if this shows that\n");
    printf("you are not getting at least 20 clock ticks per test.\n");

    printf(HLINE);

    printf("WARNING: The above is only a rough guideline.\n");
    printf("For best results, please be sure you know the\n");
    printf("precision of your system timer.\n");
    printf(HLINE);
    
    /* --- MAIN LOOP --- repeat test cases NTIMES times --- */

        alloc_barrier( &b1 );
        cnx_sp.node= CPS_ANY_NODE;
        cnx_sp.min= total_threads;
        cnx_sp.max= total_threads;
        cnx_sp.threadscope= CPS_THREAD_PARALLEL;
        num_args= 3;
        printf("Spawning %d threads.\n",total_threads);
        retval= cps_ppcalln( &cnx_sp, kernel, &num_args, a, b, c );
        if( retval < 0 ) { perror("cps_ppcalln"); exit(1); }

    /* --- SUMMARY --- */

    for (k=0; k<NTIMES; k++)
        {
        for (j=0; j<4; j++)
            {
            rmstime[j] = rmstime[j] + (times[j][k] * times[j][k]);
            mintime[j] = MIN(mintime[j], times[j][k]);
            maxtime[j] = MAX(maxtime[j], times[j][k]);
            }
        }
    
    printf("Function Rate (MB/s) RMS time Min time Max time\n");
    for (j=0; j<4; j++) {
        rmstime[j] = sqrt(rmstime[j]/(double)NTIMES);

        printf("%s%11.4f %11.4f %11.4f %11.4f\n", label[j],
               1.0E-06 * bytes[j]/mintime[j],
               rmstime[j],
               mintime[j],
               maxtime[j]);
    }
    return 0;
}

# define M 20

int
checktick()
    {
    int i, minDelta, Delta;
    double t1, t2, timesfound[M];

/* Collect a sequence of M unique time values from the system. */

    for (i = 0; i < M; i++) {
        t1 = second();
        while( ((t2=second()) - t1) < 1.0E-6 )
            ;
        timesfound[i] = t1 = t2;
        }

/*
 * Determine the minimum difference between these M values.
 * This result will be our estimate (in microseconds) for the
 * clock granularity.
 */

    minDelta = 1000000;
    for (i = 1; i < M; i++) {
        Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
        minDelta = MIN(minDelta, MAX(Delta,0));
        }

    return(minDelta);
    }

------------------------------- kernel.c ------------------------------
# include <stdio.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <sys/time.h>
# include <cps.h>
# include <spp_prog_model.h>

#include "streams.h"

extern double second();

extern double times[4][NTIMES];

extern barrier_t b1;
extern int total_threads;

void
kernel(a,b,c)
double *a, *b, *c;
    {
    double scalar;
    int thread_id, chunksize, start_indx, stop_indx;
    int quantum, checktick();
    int BytesPerWord;
    register int j, k;

    /* --- MAIN LOOP --- repeat test cases NTIMES times --- */
    thread_id= cps_stid();
    chunksize= N / total_threads;
    start_indx= thread_id * chunksize;
    stop_indx= (thread_id+1) * chunksize;
    if( thread_id >= (total_threads - 1)) stop_indx= N;

    scalar = 3.0;
    for (k=0; k<NTIMES; k++)
    {
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[0][k] = second();
        for (j=start_indx; j<stop_indx; j++)
            c[j] = a[j];
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[0][k] = second() - times[0][k];
        
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[1][k] = second();
        for (j=start_indx; j<stop_indx; j++)
            b[j] = scalar*c[j];
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[1][k] = second() - times[1][k];
        
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[2][k] = second();
        for (j=start_indx; j<stop_indx; j++)
            c[j] = a[j]+b[j];
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[2][k] = second() - times[2][k];
        
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[3][k] = second();
        for (j=start_indx; j<stop_indx; j++)
            a[j] = b[j]+scalar*c[j];
        wait_barrier( &b1, &total_threads );
        if( thread_id <= 0 ) times[3][k] = second() - times[3][k];
    }
    
    return ;
}
------------------------------- second.c ------------------------------
/* A Fortran-callable gettimeofday routine to give access
   to the wall clock timer.

   This subroutine may need to be modified slightly to get
   it to link with Fortran on your computer.
*/

#include <sys/time.h>
/* int gettimeofday(struct timeval *tp, struct timezone *tzp); */

double second()
{
/* struct timeval { long tv_sec;
            long tv_usec; };

struct timezone { int tz_minuteswest;
             int tz_dsttime; }; */

        struct timeval tp;
        struct timezone tzp;
        int i;

        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}
------------------------------- makefile ------------------------------
FC= /opt/fortran/bin/f77
CC= /opt/ansic/bin/cc
OPT= +O2
FFLAGS= $(OPT)
CFLAGS= $(OPT)
LDFLAGS= -Wl,-aarchive
LD= $(FC)
INCLUDE= -I.

all: stream_d par_stream_d

stream_d: stream_d.o second.o
        $(LD) stream_d.o second.o $(LDFLAGS) -o stream_d

par_stream_d: main.o kernel.o second.o
        $(LD) main.o kernel.o second.o $(LDFLAGS) -o par_stream_d

stream_d.o: stream_d.f

second.o: second.c

main.o: main.c streams.h
        /usr/convex/bin/cc -O3 -c main.c

kernel.o: kernel.c streams.h
        $(CC) $(OPT) +Onoparmsoverlap -c kernel.c $(INCLUDE)



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:05 CDT