Re: fast bcopy on memory bus?

From: Cliff Click (cliffc@risc.sps.mot.com)
Date: Tue Nov 26 1996 - 10:52:33 CST


"John McCalpin wrote:"
>
> In article <CLIFFC.96Nov21143249@ami.sps.mot.com> you write:
> >
> >The PPC chips have had this for awhile as a user-level instruction.
> >The memcpy/bcopy library calls do this. It's available as
> >"__dcbz()" to the C programmer. Really handy on large streaming
> >codes.
>
> Hi Cliff,
>
> So are you guys ready to send me some STREAM benchmark numbers
> for 604-based machines? I have had no luck at all in that arena --
> all I have is two Power Mac numbers (pretty dismal) and one IBM
> RS/6000-250 result. The Motorola machines ought to look better
> than these?
>
> http://www.cs.virginia.edu/stream/

Hi. "bandwidth for the masses" has been a mantra around here for a
while, but it takes time to turn that into cheap motherboards. So,
no, I don't have any exciting new numbers for you.

Instead, I have numbers for a series of older boxes, with and without
hand hacking to use dcbz. My hacking took about an hour all told, to
make cache-aligning pre-loops, cleanup loops, unroll, prefetch and
insert dcbz. All work was done at the C level, compiled with -O. You
can use these numbers as evidence that dcbz & prefetch works.

Hacked numbers are first, original stream_d.c second.

===============================================================================
===============================================================================
ami: Old IBM box, 80Mhz 601, good bandwidth

This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1500000, Offset = 0
Total memory required = 34.3 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 179999 microseconds.
   (= 18 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 126.3158 0.2011 0.1900 0.2100
Scale: 100.0000 0.2440 0.2400 0.2500
Add: 120.0000 0.3090 0.3000 0.3200
Triad: 116.1290 0.3110 0.3100 0.3200

This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1500000, Offset = 0
Total memory required = 34.3 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 169999 microseconds.
   (= 17 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 72.7273 0.3320 0.3300 0.3400
Scale: 72.7273 0.3481 0.3300 0.3600
Add: 81.8182 0.4722 0.4400 0.4900
Triad: 80.0000 0.4711 0.4500 0.4800

===============================================================================
===============================================================================
gumby: Medium-aged IBM box, 132Mhz 604, mediocre bandwidth

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1500000, Offset = 0
Total memory required = 34.3 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 249999 microseconds.
   (= 25 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 96.0000 0.2580 0.2500 0.2600
Scale: 96.0000 0.2590 0.2500 0.2600
Add: 94.7368 0.3850 0.3800 0.3900
Triad: 94.7368 0.3870 0.3800 0.3900

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1500000, Offset = 0
Total memory required = 34.3 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 249999 microseconds.
   (= 25 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 63.1579 0.3870 0.3800 0.3900
Scale: 63.1579 0.3860 0.3800 0.3900
Add: 70.5882 0.5100 0.5100 0.5100
Triad: 70.5882 0.5170 0.5100 0.5200

===============================================================================
===============================================================================
jet: Newer Motorola box, 200Mhz 604e
     This box was somewhat experimental, in that I'm not sure
     if you could every buy this from Motorola. Issues like
     memory controllers & motherboards have changed a lot
     recently.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1500000, Offset = 0
Total memory required = 34.3 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 229999 microseconds.
   (= 23 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 104.3478 0.2390 0.2300 0.2400
Scale: 100.0000 0.2451 0.2400 0.2500
Add: 100.0000 0.3640 0.3600 0.3700
Triad: 100.0000 0.3680 0.3600 0.3800

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1500000, Offset = 0
Total memory required = 34.3 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 9999 microseconds.
Each test below will take on the order of 229999 microseconds.
   (= 23 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 68.5714 0.3530 0.3500 0.3600
Scale: 68.5714 0.3530 0.3500 0.3600
Add: 80.0000 0.4661 0.4500 0.4700
Triad: 80.0000 0.4681 0.4500 0.4800

=============================================================

Cliff



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:06 CDT