Dear Prof. McCalpin,
I have some data points for your stream benchmark. I run fortran and c
version on DEC alpha 3000/900, which I got essecially the same performance.
( at Univ. of Wisconsin --- Milwaukee ). I also fortran--benchmarked at
Univ. of Illinois's Supercomputer Center ( NCSA ) on convex 3880 ( 8 cpu),
convex spp-120 ( 8 cpu ), SGI power chalenger ( 16 cpu, 90 MHZ ). There are
(1) On convex 3880, I used fc -O2, which vectorize but nor paralellize
the loops, I got the performance for 1 cpu. then I used fc -O3 -ep n, (n=2...8)
-O3 turned on parallization ( inaddtion to vec. ), -ep, expected processor,
but the performance numbers are lower than 1 cpu ( -O2's ). I suspected
that I didn't really get 8 cpu ( e.g. -ep 8 ), instead, system only give me
1 cpu because it is bussy. I called NCSA, the consultant said it is not the
case, I got all cpus I asked. Then I really confused. Maybe the time
has been 'normalized' to 1 cpu ? I tried /bin/time a.out, it shows
real 16 sec. user 6 sec. ( seems only get 1 cpu unless normalized ).
I also tried using your wallclock.c instead etime(), the result is very
inconsistant, as you can see.
(2) on convex spp-120, again I used fc -O2, get results for 1 cpu,
then fc -O3, according to NCSA consultant, I got all 8 cpus. ( there is no
way to specify how many cpus I want/can get other than O2/O3 ). Again, I got
lower performance # for -O3 than -O2.
(3) on SGI power challenger ( 16 cpu ), I setenv MP_NUM_THREADS = 2,4,.. 16
The timing clearly shows I got multiple cpus, however, the performance
number is so inconsistant, especially at large # of cpus. I made several
additinional test for cpus 8 and up, I see very large veriation.
All above timing were done by etime(), unless otherwise stated ( convex
3880 I used both etime() and wallclock() ).
I don't think the performance # for multi cpus are useful, because how
time is counted is unknown in the convex case, and it veries so wildly
in SGI case ( at least for cpu > 8 ). uni-- cpu's performance #, I think is
reasonably reliable. What else can I do ? diffrent compiler flag ?
diffrent timing routine ?
Anyway, I like stream benchmark, I wish it could be part of SPECfp 95,
weighting 1/3 of it. I myself also have a benchmark/database, a matrix
multiplication(fortran). using axpy, dot product, block/unroll ( like KAP ) with
diffrent blocking size and unroll size, blas/dgemm call if it is available,
and a few other variations, such as unroll axpy, transposed dot product.
the dimension of matrix is varied from 100, 128, 256, 400, 512, 768,
1000/1024, in some cases, 1536, 2048, 3072. The small dimension test
cpu/cache, while the large dimension of axpy, dot product test memory
bandwidth. dot product in addition test stride != 1. ( cache miss/memory bank
conflict). the axpy option ( large dimension ) is similar to stream benchmark. I mainly do it on RISC workstations and vector computers, but not much for
multi--cpus. I am a research staff ( physics ) doing surface electron
Dr. Hong Huang
at Univ. of Wisconsin---Milwaukee
This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:05 CDT