Your guess is correct: the optimizer is removing all those dead blocks
of code! I put each of the loops into a subroutine, turned off double
precision, and then used second to time the loops. The results now
make sense: 150 Mflops for the sum and scale, 300 Mflops for the saxpy,
and storage rates of 2400 Mbytes/sec for sum and scale and 3600 Mbytes/sec
on saxpy. Note that the 3600 Mbytes/sec figure is a bit below the
theoretical peak of 4000 Mbytes/sec for a saxpy, but I wasn't running
on a dedicated machine. Also, I had to run on a machine with only
64 Memory banks, so I may have had some unexpected bank conflicts.
I will send you a "certified" copy of the results once my machine
comes back up...
P.S. If you would like to see the results for a dedicated 8 processor
machine, let me know. I can probably arrange the time.
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:01 CDT