STREAM / STREAM2 results and bugs

From: Anglade Pierre-Matthieu <>
Date: Thu Feb 19 2009 - 06:52:35 CST

Dear Sir,

I have enjoyed using your STREAM benchmarking software and in return I
would like to first submit you two very slight changes that could
improve very slightly some pieces from your code.
1) I have noticed at the beginning of mysecond.c the following piece
of code :"""
double mysecond_()
double mysecond()
And would suggest the simpler (from the user point of view) : """
double mysecond();
double mysecond_(){return mysecond();}
This would avoid having to define the UNDERSCORE variable at compile time.

2) Running the OpenMP version I have noticed that the average time is
always smaller than the minimum time. Although this is not really
annoying it's still quite surprising and I would suggest replacing
line 243 in stream_omp.c :"""
avgtime[j] = avgtime[j]/(double)NTIMES;"""
by : """
avgtime[j] = avgtime[j]/(double)(NTIMES-1);"""

>From the website I have
seen that you are collecting some benchmark results on various
systems. In case it is of any interest for you I am joining the
results obtained on my own machines.
Both computers run under fedora core 10 and use gcc 4.3.2-7. On both
computers I've used the following flags for compilation :
-march=native -O3 -fomit-frame-pointer, and except for MPI run, -static.
The computer hereafter designed as nocona contains a 2.66 GHz intel
core 2 duo E6750 on intel G35 chipset with 2GB of DDR2SDRAM running at
The computer hereafter designed as barcelona contains two 2.1 GHz amd
barcelona 2352 on nvidia nforce 3600 chipset with 8x1GB of DDR2SDRAM
running at 800MHz.
The joined tar.gz file contains test results in various files labeled
as follows:
"number of threads/processes"-"number of mega double"-"computer"-"parallel lib"
where the number of threads/process is either single, two, four or eight ;
the number of mega double is 2, 8 or 32 ;
the computer can be nocona or barcelona ;
the parallel lib is either nothing, omp for openMP threading, or mpi
for openMPI multiprocess runs.
I'm also joining the results obtained for single threaded STREAM2
benchmark both as two separates files and as a postscript plot. I'm
finding those a bit strange regarding the discrepancy between
barcelona's performance in cached DAXPY, and COPY and SUM benchmarks.
According to your experience, could this be a compiler optimization
problem ?

With many thanks and best regards.

Pierre-Matthieu Anglade

Received on Thu Feb 19 07:31:19 2009

This archive was generated by hypermail 2.1.8 : Thu Jan 17 2013 - 15:37:16 CST