of Computer Science
School of Engineering and Applied Science
University of Virginia, Charlottesville,
2016-07-28: Bug fix in stream_mpi.c
I don't like MPI and the version of STREAM written in MPI is not intended as a
"standard" version, but there is a bug in the timing code that is too big to
ignore. I very carefully derived the proper way to bound the timings for MPI
runs that would guarantee that no work would be done before the start time
and no work would be done after the end time --- then what I actually implemented
in the code was something different (and provably incorrect). Argh!
The new revision of
stream_mpi.c has the RCS version number of 1.8 and a modification
date of 2016-07-28.
2016-03-28: A bit of clean-up...
Some broken links on the main page were reported to me and I decided to clean up
a few things.
- The results in the "Macintosh-compatible" category are for obsolete
Motorola 68000 and PowerPC based systems (the newest results now
10 years old), so I renamed the category as "Obsolete Macintosh-compatible"
results and moved it to the bottom of the page. This was just a subset of
the "standard" results anyway, so nothing has been lost.
- The "PC-compatible" results category is also mostly obsolete, and
I added a note to that effect (but did not move anything).
2016-01-26: Some older and some newer results....
ScaleMP has submitted new results on the 64-node system they have installed at
the University of Queensland. This shared-memory system based on 2-socket Xeon E5-2680v3
nodes delivers over 6 TB/s on the Add and Triad kernels, and reaches the number 3
position on the Top20 list.
Near the opposite end of the spectrum, I realized that I had not published results
from the HP Moonshot M300 system that I tested in June 2014. This system uses the
Intel C2750 processor (an 8-core Atom-based processor) and delivers up to 15 GB/s.
Even closer to the opposite end of the spectrum, results from the Intel Edison
board are included. This board contains a dual-core Atom (Silvermont) processor
running at 500 MHz and delivers a single-threaded bandwidth of almost 2 GB/s
out of what appears to be a 3.2 GB/s peak (based on my interpretation of the
Intel Edison Product Brief (pdf).
2015-07-28: New results, including a new TOP20 leader
SGI has submitted a new record
result on the SGI UV3000 system
delivering 12.8 TB/s to 13.8 TB/s on the four STREAM kernels. This is a shared-memory
system with 256 Intel Xeon E5-4650 v3 (12 core) processors (3072 cores total). All of the
cores were activated and available to the OS, but only 1/2 of the cores (1536) were used
to run the benchmark. (It is common for multicore systems to slow down slightly when
using all cores. At extremely large system scales this effect is likely to get worst due
to background OS activity.)
At the other end of the performance spectrum, results from the Raspberry Pi 2 Model B
are now published.
Both the Fortran and C results show relatively strong results for the
STREAM Copy kernel (1700 to 1800 MB/s -- I think that the peak memory bandwidth of the
system is 3200 MB/s, but it is hard to get details on the precise frequency used for the
LPDDR2 DRAM in the system). The kernels with arithmetic are much slower -- 500 MB/s to
950 MB/s. Presumably this is due to poor support for double-precision floating-point
arithmetic in the ARM processor core used, but again it is very hard to find this level
of detail on any ARM processor, and the details on the Broadcom BCM2836 chip used in the
Raspberry Pi 2 are particularly difficult to find.
I also finally published some results from the TACC Lonestar 4 system (based on the Intel Xeon X5680
("Westmere EP") processors. Lonestar 4 is about to be replaced by a new system, but I wanted these
results to be in the archive. The results are listed under the
Dell PowerEdge M610 model name.
Another set of (not-quite-so-) old results is from a single-socket
Intel Xeon E3-1270 (Sandy Bridge)
system. Please note that these results used only 1 of the 4 cores on the system, as this
gave slightly better performance. There is nothing wrong with using fewer cores, but it does
mean that the "STREAM Balance" calculation is biased since it compares the peak performance of
only 1 core against the memory bandwidth. If all the cores were used, the "STREAM Balance" would
be a higher number (more unbalanced) by a factor a slightly more than 4 -- a factor of 4 from
counting the peak FLOPS of all the cores and a small additional increase due to the reduction
in sustained bandwidth from the extra contention caused by using all four cores.
These last two sets of results are now tagged with "_nta" or "_alloc" after the machine name. I am
using this as a temporary measure to indicate whether "streaming" (or "nontemporal" or "cache-bypassing")
stores were used in the run (marked by "_nta"). This is discussed the the STREAM FAQ notes on
Counting Bytes and FLOPS.
Cleaning up my mailbox, I found a result that I had missed from October 2013. It is an "experimental"
result on an overclocked Windows system with an Intel Core i7-3820 processor -- normally 3.6 GHz, but
overclocked to 5 GHz. The memory was also overclocked from DDR3/1333 to DDR3/2000. Results are in the
screenshots attached to the submission.
2014-10-28: New Variants of STREAM released
Two variants of STREAM have been added to the
Versions subdirectory of the source
Both of these versions are fully compliant versions of STREAM (unless there are bugs).
- For shared memory systems, the standard OpenMP version of STREAM is strongly
preferred to the MPI version.
- Both of the new versions use dynamic memory allocation, which can change the
ability of the compiler to optimize the code. See the
READ.ME file in the
Versions subdirectory for more notes and
for a sample compile line for the MPI version of the code.
- The timing code for the MPI version has been carefully designed to deliver the
smallest time that is guaranteed to be at or above the true elapsed time from
the earliest start of execution of any rank to the latest end of execution of
- There is an important change in the way array size is defined in the MPI version:
Although it is relatively easy to get confused about sizing with either method,
I have found that with the old method it is easier to accidentally request too much
memory (which can crash some systems), while mistakes with the new method typically
request too little memory (which may not produce valid results, but does not risk
crashing anything). Your typical mistakes may differ from mine. Caveat Emptor.
- The older Fortran version of STREAM in MPI replicates the arrays on
each MPI rank.
- The new C version of STREAM in MPI distributes the arrays across
the MPI ranks.
2014-01-28: Some fixes to the database
As I was reworking some analyses, I discovered some errors in the STREAM benchmark database that I corrected:
- Incorrect frequency for some IBM POWER5 systems (was 1800 MHz but should have been 1900 MHz).
- Incorrect frequency for the HP_AlphaServer_DS15 (was 1 MHz but should have been 1000 MHz).
- Incorrect system name for the ScaleMP_Xeon5650_64B entry.
- Incorrect name and results for SuperMicro_X8DTN+_Xeon5690 system (I copied the database entry from
a similar system, but then forgot to overwrite the values with the new ones for this submission).
In addition, the attachment containing the submitted values did not get picked up by the hypermail
program, so the results did not show up in the link to the submission, either. I added the text
to the web page for the original submission manually.
- I changed the frequency of the Opteron processors in the Sun_X4640 system from 2.2 GHz to 2.4 GHz.
The system under test used 6-core AMD "Istanbul" processors and AMD did not make a 2.2 GHz part.
The documentation that I was able to find only listed 2.4 and 2.6 GHz processors for this system.
2013-01-17: STREAM version 5.10 released!
STREAM version 5.10 has finally been released -- at least in the C language.
Fortran will follow when I get around to it....
While this version does not change what STREAM measures, it does provide
a number of long-awaited features:
- Updated Validation Code: the revised version does not suffer from
the accumulation of roundoff error for large arrays. Compiling
with the "VERBOSE" preprocessor flag causes the code to print out
the computed error even if the validation passes, and causes the
code to print the values and locations of the first 10 locations
in each of the arrays that have errors exceeding the error tolerance.
- Update array index variables from "int" to "ssize_t" to allow 64-bit
systems to use array indices that are too large to fit in a 32-bit
integer data type.
- Changed the preprocessor variable defining the array size from "N"
to "STREAM_ARRAY_SIZE" to make it easier to find in the code.
- Added a preprocessor variable "STREAM_TYPE" that allows the user
to override the default data type from "double" to "float". This
could be used to change to non-floating-point data types as well,
but some printf formats would have to be changed to match.
- Updated comments in the source code on how to configure/compile/run
the STREAM benchmark.
- Many small changes in the printed output to account for computers
getting bigger and faster:
- Array sizes now printed in "GiB" as well as "MiB".
- Output format now prints fewer decimal places for the bandwidth
but more decimal places for the min/avg/max time.
2013-01-17: Many older results finally added to tables!
Dr. Bandwidth apparently lost track of many STREAM benchmark submissions in
his e-mail. Today many of these older results have been added to the tables.
The results have been added to the tables according to the original submission
dates, so here is a list of the newly added entries:
-- originally submitted on 2009-02-19
-- originally submitted on 2009-09-19
-- originally submitted on 2009-05-11
-- originally submitted on 2009-10-07. This is an "experimental" result with the "HT Assist" (aka
"probe filters") disabled via the BIOS.
- VMC_1200 -- originally
submitted on 2010-11-16. This is a "Virtualization Appliance" from the Virtual Machine Company, Ltd.
I had to guess at the CPU frequency based on the AMD "MagnyCours" processors that
were shipping at the time.
-- originally submitted 2011-09-08.
-- originally submitted on 2011-11-29. I had to guess at the CPU frequency -- I picked the middle of the
range of available frequencies for AMD 8000 series quad-core processors.
-- originally submitted 2012-03-08. This was a single-threaded run compiled with gcc and run on a VMWare
-- originally submitted on 2012-12-20
-- originally submitted on 2013-01-03. This is a "large memory" node on the "Stampede" system at the
Texas Advanced Computing Center
-- originally submitted on 2013-01-03. This is a compute node on the "Stampede" system at the
Texas Advanced Computing Center
-- originally submitted on 2013-01-03. This is one of the Intel coprocessors installed in the
"Stampede" system at the Texas Advanced Computing Center.
2010-04-17: New sorted tables added!
2010-04-17: Submission Date added to all rows of all tables
With the constantly changing nomenclature in the market, it is hard for
even me to remember if an entry is an old system or a new one. Now every
row on every table has the submission date as the first field. So if
seeing all the results sorted by date (see item above) is not what you
want, you can still see that critical information in all of the tables.
If you notice systems for which the availability date was significantly
earlier than the STREAM benchmark submission date (e.g., old Cray
systems), please e-mail
any relevant info -- I will be adding availability dates to the database
as time permits.
2009-11-28: Back on line & minor updates
Due to a minor administrative error, the STREAM benchmark web site has not
displayed the updates that I made since August 2009. This is now fixed and
the web site is now correctly displaying the newest updates.
I noticed a minor error in some old results when I added the Fujitsu SPARC
Enterprise M8000 and M9000 results. When I added the 2007 results for those
two systems, I incorrectly listed the CPU frequency as 2150 MHz rather than
the correct value of 2400 MHz. This does not effect the bandwidth numbers,
but it does change the peak MFLOPS and Machine Balance calculations. The
new computations are reflected in the standard and top20 lists.
2008-12-06: New Windows binaries available
An easy-to-use package with versions for 32-bit and 64-bit Windows
and including both single-threaded and OpenMP versions is available:
Let me know if this does not work for you.
2008-12-06: STREAM Variants at Oak Ridge National Labs
The folks at Oak Ridge National Labs have created a fairly large
number of variants of the STREAM benchmark for different data types and
for different programming languages. From the
Extreme Scale Software Center
page, there are currently versions of STREAM in several subdirectories:
2008-07-28: New version of STREAM in Java!
Although it is not yet compliant with the STREAM run rules for
multi-threaded runs, there is now a very clean (FORTRAN-looking)
Java version of the STREAM benchmark in the
What's Not so New Any More!
2007-06-18: New version of STREAM in Pascal!
So how many of you are there are old enough to remember a programming
language called "Pascal"? I recall doing a little bit of programming
in Pascal using the UCSD Pascal "p-code" integrated development
environment on an Apple ][ computer.
A participant in the
project donated a Pascal version of STREAM that I have placed in the
2005-02-07: Re-organization of the STREAM ftp site
The Department of Computer Science at the University of Virginia has
discontinued anonymous ftp service. So I have moved the stuff that
used to only be available by ftp into a new directory that can be
accessed by http here.
2004-11-17: Re-organization of the STREAM ftp site and file names
The most recent "official" versions have been renamed "stream.f" and
"stream.c" -- all other versions have been moved to the "Versions"
The "official" timer (was "second_wall.c") has been renamed "mysecond.c".
This is embedded in the C version ("stream.c"), but still needs to be
externally linked to the FORTRAN version ("stream.f").
2004-11-17: New "Getting Started" section in FAQ
Due to a number of recent requests for help in getting started, I
have renamed the files and added some new "Getting Started"
instructions in the FAQ.
New Opteron Binary
A STREAM binary for Opteron/Athlon64 has been submitted to the
STREAM web site. It is reported to work on SUSE, Red Hat, and
Fedora. Give it a try, and let me know if it works!
Here is the
link to the submission.
2003-09-23: Minor change in reporting for MPI runs
After reviewing several requests, I have decided to allow a subset
of MPI results to be published in the "standard" table. The
specific case that I allow is when the MPI version of the code is
used to run the STREAM benchmark on a single SMP system.
I do not require that all of the MPI processes run under a single O/S,
only that the hardware supports shared memory operation that would
allow a single O/S to be run if so configured. These "MPI on SMP"
results are marked with the notation "(MPI Result)" on the standard
benchmark page, and these results are eligible for inclusion in the
For completeness, these results will also continue
to be included in the MPI table.
Feel free to flame me if you think that this is a bad idea.
e-mail flames here
I have written up my methodology for estimating STREAM Triad bandwidth using
published results from the 171.swim benchmark in SPEC's CPU2000 suite. These
are not publishable STREAM results, but they definitely give a good rough
estimate on a wide variety of systems.
This is a slightly updated version of a presentation that I made at
the TOP500 meeting at SuperComputing2002 in November. The slides
show a methodology for getting reasonably accurate estimates of
SPECfp_rate2000 using only one measurement, plus a set of architectural
parameters of the system.
2002-11-05: A new category of results has been defined!
In response to popular demand, a new category has been created explicitly for "tuned" results.
This category allows source code modifications to increase performance, including assembly language coding.
2002-11-05: The "Top10" list has been extended to the "Top20" list.
"Top20" list is drawn from
both the "standard" and "tuned" results. This provides a "show-off" category to
allow folks to show off the best realizable performance on their
systems, similar to the LINPACK NxN (aka LINPACK HPC) benchmark.
2002-06-25: An MPI version of STREAM has been released.
Check it out.
2002-02-25: An OpenMP version is available in C!
Version 4 of the STREAM benchmark in C is now available with OpenMP directives.
This version still does not have the error-checking or anti-optimization code
that is contained in version 5 (still FORTRAN only), but it does allow shared-memory
parallel execution on many systems!
2000-07-30: Version 5 of the STREAM benchmark is ready!
Version 5 of the STREAM benchmark is available for FORTRAN users.
New features include:
OpenMP directives for parallel execution on many shared-memory machines
Added a new subroutine to validate results
Added modifications to array elements to prevent aggressive optimizers
from moving the timer calls. (This is a problem with IBM's XLF 7.1 when
using prior versions of STREAM.)
Switched from RMS time to AVG time in output. (Min time is still
used to calculate the bandwidths.)
STREAM now excludes the first and last iterations when looking for
2000-04-18: All e-mail contributions caught up!
I am still working on adding links from the original submissions to the
entries in the tables. All are complete for the submissions in 1996
and 2000, and I am working toward the middle.....
I also switched from 68k Linux to PowerPC Linux for my home machine,
so I have a much more complete set of tools. Look for STREAM listings
for the PowerComputing PowerCurve 601/120.
2000-04-18: FAQ Revised and Updated!
The STREAM FAQ has been revised and updated with
a much more useful discussion of the STREAM benchmark run rules. Check
Really OLD News
1999.11.10: e-mail updates
I have updated the hypermail archives
of the original submissions of STREAM results. Not everything has made
it into the tables yet, but I am slowly catching up.
1999.10.29: New Categories for PC's and Mac's
I have added tables for "standard" results for PC's (and compatibles) and
Macintosh's (and compatibles). "Non-standard/experimental" results
and "32-bit" results are still in the corresponding tables -- these tables
are small enough that you can find the Mac and PC results easily enough.
1999.10.29: Links to Original Data
I am beginning to add links from the data pages back to the original e-mail
submissions (in hypermail format). Unfortunately, this is an entirely
manual process, so it is going to take a while. I also have the problem
that I cannot get the new version of hypermail to compile on my AIX machine,
so it will probably be slow going until I get Linux up on my Mac (68k)
1999.10.18: I'm back!
STREAM's author and maintainer (that would be me)
is back. I just moved from Silicon Valley to Austin, TX, and have had a
bit of trouble getting all my computers and accounts configured.
As of Monday, October 18, 1999, STREAM is back in business, and I will
be happy to accept results and will get them in the tables promptly.
1999.10.18: AMD Athlon (aka K7) Results!!!
The first results from the new AMD Athlon (aka K7) are in, courtesy of
Claus Jeppesen (firstname.lastname@example.org). This machine is a screamer, posting
over 450 MB/s on three of the four kernels! Watch out Intel!
Summer 1999: New Reporting Rules for Systems with Multiple Independent
A Change of Categories for Some Multiprocessor Results: A "partially
depopulated" system is one in which only a subset of the cpus are used
for the benchmark, and for which this subset is spread around the machine
to decrease contention. For example on the SGI Origin2000, each node has
2 cpus sharing a single bus and memory subsystem. The results in the "experimental/nonstandard"
table labelled "1 per node" are based on using only one cpu per node board,
and are considered a "nonstandard" way of using the machine. Similarly,
the Sun Ultra10000 has 4 cpus per node board, so results using 1, 2, or
3 cpus per node also go into this table of "nonstandard" results.
From this point forward, "partially populated" results will no longer
be included in the table of "standard" results -- only "fully populated"
multiprocessor results will be considered "standard". "Partially populated"
results are still welcome, but will be placed in the "experimental/nonstandard"
tables, which have been renamed (from simply "experimental") to more clearly
denote the non-standard use of the hardware.
Summer 1999: Linux Executables for IA32
There are some new contributed sources and binaries in the
The linux executables run great on Pentium II & Pentium III machines.
We are still a bit puzzled by the Windows NT Server results. Any gurus
out there willing to help out?
1999.10.18: Mason Cabot and I figured
out that the poor results under Windows NT were the result of a misconfiguration
on my part. My Linux boxes all had the full complement of SIMM's installed,
while the NT machines had the minimum configuration. Apparently the Intel
chipset used in this box (the SGI 1400) requires a fully loaded memory
system to reach full bandwidth.
Mac PowerPC executable set
A set of Mac executables is available now in Stuff-ed format. Get your
An Old DOS version
Dennis Lee has revised a version of STREAM for DOS machines. The executable
and a README file are available in a zip
Mailing List Archive
I have created a mailing list to discuss future developments of STREAM
and memory bandwidth benchmarking in general. The archives of the mailing
list are here in hypermail format.
Hypermail archives of old e-mail
Thanks to some friendly pointers to the right tools, the five years of
contributions to the STREAM archive is now available for surfing. Go
for whatever details exist on machine details, compilers, compiler options,
New Results in 1996:
Mon Dec 16, 10:58am GMT 1996 --- Dell P6-200 (Natoma chipset)
Tue Dec 10, 3:01pm 1996 --- Apple PowerMac 7500 w/ 150 MHz PPC604e
Mon Dec 9, 3:18pm EST 1996 --- PowerCenter 120 (Mac Clone)
Sat Dec 7, 2:35am 1996 --- HP C180
Tue Nov 26, 10:52am 1996 --- IBM RS/6000-43P
Tue Nov 26, 10:52am 1996 --- IBM RS/6000-25T
Sat Nov 23 1:24am EST 1996 --- Apple PowerMac 7500/100
Fri Nov 22 16:32:58 EST 1996 --- Cray T932, 1-32 cpus
Fri Nov 22 16:32:58 EST 1996 --- Sun Ultra II, 1-2 cpus
Fri Nov 22 16:32:58 EST 1996 --- Compaq Proliant 5000
Fri Nov 22 16:32:58 EST 1996 --- Gateway 2000 P6/200
Fri Nov 22 16:32:58 EST 1996 --- HP/9000-735/125
Fri Nov 22 16:32:58 EST 1996 --- HP/9000-712/80
Mon Oct 7 09:43:46 EDT 1996 --- SGI Origin 2000, 1-32 cpus
Mon Oct 7 09:43:46 EDT 1996 --- SGI Origin 200, 1-2 cpus
Mon Oct 7 09:43:46 EDT 1996 --- Convex Exemplar S, 6-16 cpus
Mon Sep 16 14:56:23 EDT 1996 --- SGI Power Challenge 10000, 1-6 cpus
Sat Jul 6 09:33:15 EDT 1996 --- NEC SX-4, 1-4 cpus
Mon May 6 14:33:00 EDT 1996 --- DEC 4100/300E, 4100/300, 4100/400
Tue Apr 30 18:02:00 EDT 1996 --- DEC 8400-5-350 1-8 cpus
Thu Apr 25 11:16:33 EDT 1996 --- Sun Ultra Enterprise 6000 1-32 cpus ---
assembly coded kernels
Mon Apr 22 12:13:43 EDT 1996 --- Sun Ultra Enterprise 6000 1-32 cpus
Thu Apr 4 10:47:40 EST 1996 --- Asus Pentium 200 and Pentium 180
Mon Apr 1 1996 --- Cray J932 1-32 cpus
Mon Apr 1 1996 --- Tera MTA 1-16 cpus (simulated 300 MHz results)
Thu Mar 28 12:45:42 EST 1996 --- Convex SPP-1600 1-32 cpus
Tue Feb 27 10:49:01 EST 1996 --- Sun Ultra I/170 update
Thu Jan 11 06:02:04 EST 1996 --- HP 9000/819 (K200)
May 12, 1996:
There has been a complete change in the underlying structure of the database.
The entire database of raw results is now contained in a single, comma-delimited
datafile. This is described in the Table subdirectory.
John D. McCalpin email@example.com