What's NEW with the STREAM Benchmark!

Department of Computer Science
School of Engineering and Applied Science
University of Virginia, Charlottesville, Virginia

What's New!

List of the 30 newest entries!

2016-07-28: Bug fix in stream_mpi.c

I don't like MPI and the version of STREAM written in MPI is not intended as a "standard" version, but there is a bug in the timing code that is too big to ignore. I very carefully derived the proper way to bound the timings for MPI runs that would guarantee that no work would be done before the start time and no work would be done after the end time --- then what I actually implemented in the code was something different (and provably incorrect). Argh!

The new revision of stream_mpi.c has the RCS version number of 1.8 and a modification date of 2016-07-28.

2016-03-28: A bit of clean-up...

Some broken links on the main page were reported to me and I decided to clean up a few things.

The results in the "Macintosh-compatible" category are for obsolete Motorola 68000 and PowerPC based systems (the newest results now 10 years old), so I renamed the category as "Obsolete Macintosh-compatible" results and moved it to the bottom of the page. This was just a subset of the "standard" results anyway, so nothing has been lost.
The "PC-compatible" results category is also mostly obsolete, and I added a note to that effect (but did not move anything).

2016-01-26: Some older and some newer results....

ScaleMP has submitted new results on the 64-node system they have installed at the University of Queensland. This shared-memory system based on 2-socket Xeon E5-2680v3 nodes delivers over 6 TB/s on the Add and Triad kernels, and reaches the number 3 position on the Top20 list.

Near the opposite end of the spectrum, I realized that I had not published results from the HP Moonshot M300 system that I tested in June 2014. This system uses the Intel C2750 processor (an 8-core Atom-based processor) and delivers up to 15 GB/s.

Even closer to the opposite end of the spectrum, results from the Intel Edison board are included. This board contains a dual-core Atom (Silvermont) processor running at 500 MHz and delivers a single-threaded bandwidth of almost 2 GB/s out of what appears to be a 3.2 GB/s peak (based on my interpretation of the Intel Edison Product Brief (pdf).

2015-07-28: New results, including a new TOP20 leader

SGI has submitted a new record result on the SGI UV3000 system delivering 12.8 TB/s to 13.8 TB/s on the four STREAM kernels. This is a shared-memory system with 256 Intel Xeon E5-4650 v3 (12 core) processors (3072 cores total). All of the cores were activated and available to the OS, but only 1/2 of the cores (1536) were used to run the benchmark. (It is common for multicore systems to slow down slightly when using all cores. At extremely large system scales this effect is likely to get worst due to background OS activity.)

At the other end of the performance spectrum, results from the Raspberry Pi 2 Model B are now published. Both the Fortran and C results show relatively strong results for the STREAM Copy kernel (1700 to 1800 MB/s -- I think that the peak memory bandwidth of the system is 3200 MB/s, but it is hard to get details on the precise frequency used for the LPDDR2 DRAM in the system). The kernels with arithmetic are much slower -- 500 MB/s to 950 MB/s. Presumably this is due to poor support for double-precision floating-point arithmetic in the ARM processor core used, but again it is very hard to find this level of detail on any ARM processor, and the details on the Broadcom BCM2836 chip used in the Raspberry Pi 2 are particularly difficult to find.

I also finally published some results from the TACC Lonestar 4 system (based on the Intel Xeon X5680 ("Westmere EP") processors. Lonestar 4 is about to be replaced by a new system, but I wanted these results to be in the archive. The results are listed under the Dell PowerEdge M610 model name.

Another set of (not-quite-so-) old results is from a single-socket Intel Xeon E3-1270 (Sandy Bridge) system. Please note that these results used only 1 of the 4 cores on the system, as this gave slightly better performance. There is nothing wrong with using fewer cores, but it does mean that the "STREAM Balance" calculation is biased since it compares the peak performance of only 1 core against the memory bandwidth. If all the cores were used, the "STREAM Balance" would be a higher number (more unbalanced) by a factor a slightly more than 4 -- a factor of 4 from counting the peak FLOPS of all the cores and a small additional increase due to the reduction in sustained bandwidth from the extra contention caused by using all four cores.

These last two sets of results are now tagged with "_nta" or "_alloc" after the machine name. I am using this as a temporary measure to indicate whether "streaming" (or "nontemporal" or "cache-bypassing") stores were used in the run (marked by "_nta"). This is discussed the the STREAM FAQ notes on Counting Bytes and FLOPS.

Cleaning up my mailbox, I found a result that I had missed from October 2013. It is an "experimental" result on an overclocked Windows system with an Intel Core i7-3820 processor -- normally 3.6 GHz, but overclocked to 5 GHz. The memory was also overclocked from DDR3/1333 to DDR3/2000. Results are in the screenshots attached to the submission.

2014-10-28: New Variants of STREAM released

Two variants of STREAM have been added to the Versions subdirectory of the source code tree.

stream_mpi.c: a new port of STREAM version 5.10 to MPI.
stream_5-10_posix_memalign.c: a variant of version 5.10 using dynamic array allocation.

Both of these versions are fully compliant versions of STREAM (unless there are bugs).

Important notes:

For shared memory systems, the standard OpenMP version of STREAM is strongly preferred to the MPI version.
Both of the new versions use dynamic memory allocation, which can change the ability of the compiler to optimize the code. See the READ.ME file in the Versions subdirectory for more notes and for a sample compile line for the MPI version of the code.
The timing code for the MPI version has been carefully designed to deliver the smallest time that is guaranteed to be at or above the true elapsed time from the earliest start of execution of any rank to the latest end of execution of any rank.
There is an important change in the way array size is defined in the MPI version:
- The older Fortran version of STREAM in MPI replicates the arrays on each MPI rank.
- The new C version of STREAM in MPI distributes the arrays across the MPI ranks.
Although it is relatively easy to get confused about sizing with either method, I have found that with the old method it is easier to accidentally request too much memory (which can crash some systems), while mistakes with the new method typically request too little memory (which may not produce valid results, but does not risk crashing anything). Your typical mistakes may differ from mine. Caveat Emptor.

2014-01-28: Some fixes to the database

As I was reworking some analyses, I discovered some errors in the STREAM benchmark database that I corrected:

Incorrect frequency for some IBM POWER5 systems (was 1800 MHz but should have been 1900 MHz).
Incorrect frequency for the HP_AlphaServer_DS15 (was 1 MHz but should have been 1000 MHz).
Incorrect system name for the ScaleMP_Xeon5650_64B entry.
Incorrect name and results for SuperMicro_X8DTN+_Xeon5690 system (I copied the database entry from a similar system, but then forgot to overwrite the values with the new ones for this submission). In addition, the attachment containing the submitted values did not get picked up by the hypermail program, so the results did not show up in the link to the submission, either. I added the text to the web page for the original submission manually.
I changed the frequency of the Opteron processors in the Sun_X4640 system from 2.2 GHz to 2.4 GHz. The system under test used 6-core AMD "Istanbul" processors and AMD did not make a 2.2 GHz part. The documentation that I was able to find only listed 2.4 and 2.6 GHz processors for this system.

2013-01-17: STREAM version 5.10 released!

STREAM version 5.10 has finally been released -- at least in the C language. Fortran will follow when I get around to it....

While this version does not change what STREAM measures, it does provide a number of long-awaited features:

Updated Validation Code: the revised version does not suffer from the accumulation of roundoff error for large arrays. Compiling with the "VERBOSE" preprocessor flag causes the code to print out the computed error even if the validation passes, and causes the code to print the values and locations of the first 10 locations in each of the arrays that have errors exceeding the error tolerance.
Update array index variables from "int" to "ssize_t" to allow 64-bit systems to use array indices that are too large to fit in a 32-bit integer data type.
Changed the preprocessor variable defining the array size from "N" to "STREAM_ARRAY_SIZE" to make it easier to find in the code.
Added a preprocessor variable "STREAM_TYPE" that allows the user to override the default data type from "double" to "float". This could be used to change to non-floating-point data types as well, but some printf formats would have to be changed to match.
Updated comments in the source code on how to configure/compile/run the STREAM benchmark.
Many small changes in the printed output to account for computers getting bigger and faster:
- Array sizes now printed in "GiB" as well as "MiB".
- Output format now prints fewer decimal places for the bandwidth but more decimal places for the min/avg/max time.

2013-01-17: Many older results finally added to tables!

Dr. Bandwidth apparently lost track of many STREAM benchmark submissions in his e-mail. Today many of these older results have been added to the tables. The results have been added to the tables according to the original submission dates, so here is a list of the newly added entries:

Generic_Opteron_2352 -- originally submitted on 2009-02-19
Generic_Core2Duo_E6750 -- originally submitted on 2009-09-19
SuperMicro_X8DTN_Xeon5580 -- originally submitted on 2009-05-11
SuperMicro_BHQME_Opteron8435_Exp -- originally submitted on 2009-10-07. This is an "experimental" result with the "HT Assist" (aka "probe filters") disabled via the BIOS.
VMC_1200 -- originally submitted on 2010-11-16. This is a "Virtualization Appliance" from the Virtual Machine Company, Ltd. I had to guess at the CPU frequency based on the AMD "MagnyCours" processors that were shipping at the time.
SuperMicro_X8DTN+_Xeon5590 -- originally submitted 2011-09-08.
Sun_X6460 -- originally submitted on 2011-11-29. I had to guess at the CPU frequency -- I picked the middle of the range of available frequencies for AMD 8000 series quad-core processors.
SuperMicro_X8DAH_XeonE5645 -- originally submitted 2012-03-08. This was a single-threaded run compiled with gcc and run on a VMWare virtual machine.
Raspberry_Pi -- originally submitted on 2012-12-20
Dell_PowerEdge_820 -- originally submitted on 2013-01-03. This is a "large memory" node on the "Stampede" system at the Texas Advanced Computing Center
Dell_DCS8000 -- originally submitted on 2013-01-03. This is a compute node on the "Stampede" system at the Texas Advanced Computing Center
Intel_XeonPhi_SE10P -- originally submitted on 2013-01-03. This is one of the Intel coprocessors installed in the "Stampede" system at the Texas Advanced Computing Center.

2010-04-17: New sorted tables added!

2010-04-17: Submission Date added to all rows of all tables

With the constantly changing nomenclature in the market, it is hard for even me to remember if an entry is an old system or a new one. Now every row on every table has the submission date as the first field. So if seeing all the results sorted by date (see item above) is not what you want, you can still see that critical information in all of the tables.

If you notice systems for which the availability date was significantly earlier than the STREAM benchmark submission date (e.g., old Cray systems), please e-mail any relevant info -- I will be adding availability dates to the database as time permits.

2009-11-28: Back on line & minor updates

Due to a minor administrative error, the STREAM benchmark web site has not displayed the updates that I made since August 2009. This is now fixed and the web site is now correctly displaying the newest updates.
I noticed a minor error in some old results when I added the Fujitsu SPARC Enterprise M8000 and M9000 results. When I added the 2007 results for those two systems, I incorrectly listed the CPU frequency as 2150 MHz rather than the correct value of 2400 MHz. This does not effect the bandwidth numbers, but it does change the peak MFLOPS and Machine Balance calculations. The new computations are reflected in the standard and top20 lists.

2008-12-06: New Windows binaries available

An easy-to-use package with versions for 32-bit and 64-bit Windows and including both single-threaded and OpenMP versions is available: StreamWin-32-64_distro.zip. Let me know if this does not work for you.

2008-12-06: STREAM Variants at Oak Ridge National Labs

The folks at Oak Ridge National Labs have created a fairly large number of variants of the STREAM benchmark for different data types and for different programming languages. From the Extreme Scale Software Center page, there are currently versions of STREAM in several subdirectories:

Chapel
Fortress
X10
C

2008-07-28: New version of STREAM in Java!

Although it is not yet compliant with the STREAM run rules for multi-threaded runs, there is now a very clean (FORTRAN-looking) Java version of the STREAM benchmark in the Contrib area.

What's Not so New Any More!

2007-06-18: New version of STREAM in Pascal!

So how many of you are there are old enough to remember a programming language called "Pascal"? I recall doing a little bit of programming in Pascal using the UCSD Pascal "p-code" integrated development environment on an Apple ][ computer.
A participant in the Free Pascal project donated a Pascal version of STREAM that I have placed in the Contrib area.

2005-02-07: Re-organization of the STREAM ftp site

The Department of Computer Science at the University of Virginia has discontinued anonymous ftp service. So I have moved the stuff that used to only be available by ftp into a new directory that can be accessed by http here.

2004-11-17: Re-organization of the STREAM ftp site and file names

The most recent "official" versions have been renamed "stream.f" and "stream.c" -- all other versions have been moved to the "Versions" subdirectory.
The "official" timer (was "second_wall.c") has been renamed "mysecond.c". This is embedded in the C version ("stream.c"), but still needs to be externally linked to the FORTRAN version ("stream.f").

2004-11-17: New "Getting Started" section in FAQ

Due to a number of recent requests for help in getting started, I have renamed the files and added some new "Getting Started" instructions in the FAQ.

New Opteron Binary

A STREAM binary for Opteron/Athlon64 has been submitted to the STREAM web site. It is reported to work on SUSE, Red Hat, and Fedora. Give it a try, and let me know if it works!
Here is the link to the submission.

2003-09-23: Minor change in reporting for MPI runs

After reviewing several requests, I have decided to allow a subset of MPI results to be published in the "standard" table. The specific case that I allow is when the MPI version of the code is used to run the STREAM benchmark on a single SMP system.

I do not require that all of the MPI processes run under a single O/S, only that the hardware supports shared memory operation that would allow a single O/S to be run if so configured. These "MPI on SMP" results are marked with the notation "(MPI Result)" on the standard benchmark page, and these results are eligible for inclusion in the "TOP20" list.

For completeness, these results will also continue to be included in the MPI table.

Feel free to flame me if you think that this is a bad idea. e-mail flames here

2002-12-04: Estimating Sustainable Bandwidth using SPEC CPU2000 Results

I have written up my methodology for estimating STREAM Triad bandwidth using published results from the 171.swim benchmark in SPEC's CPU2000 suite. These are not publishable STREAM results, but they definitely give a good rough estimate on a wide variety of systems.

2002-12-04: Estimating SPECfp_rate2000 using Peak GFLOPs and STREAM Triad (PowerPoint presentation)

This is a slightly updated version of a presentation that I made at the TOP500 meeting at SuperComputing2002 in November. The slides show a methodology for getting reasonably accurate estimates of SPECfp_rate2000 using only one measurement, plus a set of architectural parameters of the system.

2002-11-05: A new category of results has been defined!

In response to popular demand, a new category has been created explicitly for "tuned" results. This category allows source code modifications to increase performance, including assembly language coding.

2002-11-05: The "Top10" list has been extended to the "Top20" list.

"Top20" list is drawn from both the "standard" and "tuned" results. This provides a "show-off" category to allow folks to show off the best realizable performance on their systems, similar to the LINPACK NxN (aka LINPACK HPC) benchmark.

2002-06-25: An MPI version of STREAM has been released.

Check it out.

2002-02-25: An OpenMP version is available in C!

Version 4 of the STREAM benchmark in C is now available with OpenMP directives. This version still does not have the error-checking or anti-optimization code that is contained in version 5 (still FORTRAN only), but it does allow shared-memory parallel execution on many systems!

2000-07-30: Version 5 of the STREAM benchmark is ready!

Version 5 of the STREAM benchmark is available for FORTRAN users.

New features include:

OpenMP directives for parallel execution on many shared-memory machines
Added a new subroutine to validate results
Added modifications to array elements to prevent aggressive optimizers from moving the timer calls. (This is a problem with IBM's XLF 7.1 when using prior versions of STREAM.)
Switched from RMS time to AVG time in output. (Min time is still used to calculate the bandwidths.)
STREAM now excludes the first and last iterations when looking for minimum times.

2000-04-18: All e-mail contributions caught up!

I am still working on adding links from the original submissions to the entries in the tables. All are complete for the submissions in 1996 and 2000, and I am working toward the middle.....

I also switched from 68k Linux to PowerPC Linux for my home machine, so I have a much more complete set of tools. Look for STREAM listings for the PowerComputing PowerCurve 601/120.

2000-04-18: FAQ Revised and Updated!

The STREAM FAQ has been revised and updated with a much more useful discussion of the STREAM benchmark run rules. Check it out!

Really OLD News

1999.11.10: e-mail updates

I have updated the hypermail archives of the original submissions of STREAM results. Not everything has made it into the tables yet, but I am slowly catching up.

1999.10.29: New Categories for PC's and Mac's

I have added tables for "standard" results for PC's (and compatibles) and Macintosh's (and compatibles). "Non-standard/experimental" results and "32-bit" results are still in the corresponding tables -- these tables are small enough that you can find the Mac and PC results easily enough.

1999.10.29: Links to Original Data

I am beginning to add links from the data pages back to the original e-mail submissions (in hypermail format). Unfortunately, this is an entirely manual process, so it is going to take a while. I also have the problem that I cannot get the new version of hypermail to compile on my AIX machine, so it will probably be slow going until I get Linux up on my Mac (68k) at home.

1999.10.18: I'm back!

STREAM's author and maintainer (that would be me) is back. I just moved from Silicon Valley to Austin, TX, and have had a bit of trouble getting all my computers and accounts configured.
As of Monday, October 18, 1999, STREAM is back in business, and I will be happy to accept results and will get them in the tables promptly.

1999.10.18: AMD Athlon (aka K7) Results!!!

The first results from the new AMD Athlon (aka K7) are in, courtesy of Claus Jeppesen (jeppesen@mrl.ucsb.edu). This machine is a screamer, posting over 450 MB/s on three of the four kernels! Watch out Intel!

Summer 1999: New Reporting Rules for Systems with Multiple Independent Memory Subsystems!!!!

A Change of Categories for Some Multiprocessor Results: A "partially depopulated" system is one in which only a subset of the cpus are used for the benchmark, and for which this subset is spread around the machine to decrease contention. For example on the SGI Origin2000, each node has 2 cpus sharing a single bus and memory subsystem. The results in the "experimental/nonstandard" table labelled "1 per node" are based on using only one cpu per node board, and are considered a "nonstandard" way of using the machine. Similarly, the Sun Ultra10000 has 4 cpus per node board, so results using 1, 2, or 3 cpus per node also go into this table of "nonstandard" results.

From this point forward, "partially populated" results will no longer be included in the table of "standard" results -- only "fully populated" multiprocessor results will be considered "standard". "Partially populated" results are still welcome, but will be placed in the "experimental/nonstandard" tables, which have been renamed (from simply "experimental") to more clearly denote the non-standard use of the hardware.

Summer 1999: Linux Executables for IA32

There are some new contributed sources and binaries in the Intel directory.

The linux executables run great on Pentium II & Pentium III machines. We are still a bit puzzled by the Windows NT Server results. Any gurus out there willing to help out?

1999.10.18: Mason Cabot and I figured out that the poor results under Windows NT were the result of a misconfiguration on my part. My Linux boxes all had the full complement of SIMM's installed, while the NT machines had the minimum configuration. Apparently the Intel chipset used in this box (the SGI 1400) requires a fully loaded memory system to reach full bandwidth.

Mac PowerPC executable set

A set of Mac executables is available now in Stuff-ed format. Get your own copy.

An Old DOS version

Dennis Lee has revised a version of STREAM for DOS machines. The executable and a README file are available in a zip file.

Mailing List Archive

I have created a mailing list to discuss future developments of STREAM and memory bandwidth benchmarking in general. The archives of the mailing list are here in hypermail format.

Hypermail archives of old e-mail

Thanks to some friendly pointers to the right tools, the five years of contributions to the STREAM archive is now available for surfing. Go here for whatever details exist on machine details, compilers, compiler options, etc....

New Results in 1996:

Mon Dec 16, 10:58am GMT 1996 --- Dell P6-200 (Natoma chipset)
Tue Dec 10, 3:01pm 1996 --- Apple PowerMac 7500 w/ 150 MHz PPC604e
Mon Dec 9, 3:18pm EST 1996 --- PowerCenter 120 (Mac Clone)
Sat Dec 7, 2:35am 1996 --- HP C180
Tue Nov 26, 10:52am 1996 --- IBM RS/6000-43P
Tue Nov 26, 10:52am 1996 --- IBM RS/6000-25T
Sat Nov 23 1:24am EST 1996 --- Apple PowerMac 7500/100
Fri Nov 22 16:32:58 EST 1996 --- Cray T932, 1-32 cpus
Fri Nov 22 16:32:58 EST 1996 --- Sun Ultra II, 1-2 cpus
Fri Nov 22 16:32:58 EST 1996 --- Compaq Proliant 5000
Fri Nov 22 16:32:58 EST 1996 --- Gateway 2000 P6/200
Fri Nov 22 16:32:58 EST 1996 --- HP/9000-735/125
Fri Nov 22 16:32:58 EST 1996 --- HP/9000-712/80
Mon Oct 7 09:43:46 EDT 1996 --- SGI Origin 2000, 1-32 cpus
Mon Oct 7 09:43:46 EDT 1996 --- SGI Origin 200, 1-2 cpus
Mon Oct 7 09:43:46 EDT 1996 --- Convex Exemplar S, 6-16 cpus
Mon Sep 16 14:56:23 EDT 1996 --- SGI Power Challenge 10000, 1-6 cpus
Sat Jul 6 09:33:15 EDT 1996 --- NEC SX-4, 1-4 cpus
Mon May 6 14:33:00 EDT 1996 --- DEC 4100/300E, 4100/300, 4100/400
Tue Apr 30 18:02:00 EDT 1996 --- DEC 8400-5-350 1-8 cpus
Thu Apr 25 11:16:33 EDT 1996 --- Sun Ultra Enterprise 6000 1-32 cpus --- assembly coded kernels
Mon Apr 22 12:13:43 EDT 1996 --- Sun Ultra Enterprise 6000 1-32 cpus
Thu Apr 4 10:47:40 EST 1996 --- Asus Pentium 200 and Pentium 180
Mon Apr 1 1996 --- Cray J932 1-32 cpus
Mon Apr 1 1996 --- Tera MTA 1-16 cpus (simulated 300 MHz results)
Thu Mar 28 12:45:42 EST 1996 --- Convex SPP-1600 1-32 cpus
Tue Feb 27 10:49:01 EST 1996 --- Sun Ultra I/170 update
Thu Jan 11 06:02:04 EST 1996 --- HP 9000/819 (K200)

May 12, 1996:

There has been a complete change in the underlying structure of the database. The entire database of raw results is now contained in a single, comma-delimited datafile. This is described in the Table subdirectory.

John D. McCalpin john@mccalpin.com