Re: Sun Ultra HPC Memory Performance

From: Alan Charlesworth (alanc@West.Sun.COM)
Date: Thu Mar 13 1997 - 22:44:36 CST


(A copy of this message has also been posted to the following newsgroups:
comp.sys.super, comp.arch, comp.benchmarks, comp.sys.sun.misc)

In article <5g9mqi$kif@murrow.corp.sgi.com>, mccalpin@asd.sgi.com wrote:

>
> I have some numbers for the UE 10000 (64-cpu only), but my
> understanding is that they were preliminary, so I was waiting
> for the rest of the numbers before putting them in the table.
>
> I guess I should follow up on this and find out if I misunderstood
> the intent of the message I received from Sun. I was certainly
> hoping to get some numbers from smaller processor counts on the
> UE10000 as well.
>
> Since the numbers were posted to USENET, I will repeat them here:
>
> omitted
> --
> John D. McCalpin, Ph.D. Supercomputing Performance Analyst
> Scalable Systems Group http://reality.sgi.com/employees/mccalpin
> Silicon Graphics, Inc. mccalpin@sgi.com 415-933-7407

Sorry John, for my not getting these out to the public sooner. Here are the
Starfire Stream results that I ran at the end of January.

1. Auto-parallel C Stream bandwidth
           Copy Scale Vadd Triad
 Cpus MBps MBps MBps MBps
     1 164 164 202 202
     8 1,271 1,270 1,544 1,546
    16 2,371 2,414 2,942 2,905
    24 3,568 3,577 4,292 4,305
    32 4,397 4,408 5,166 5,188
    40 5,317 5,374 6,162 6,222
    48 5,961 6,056 6,861 6,914
    56 6,183 6,304 7,131 7,128
    63 6,307 6,391 7,203 7,197

2. Auto-parallel C total interconnect bandwidth

These are the Table 1 numbers, multiplied by 3/2 for copy
and scale, and 4/3 for vadd and triad -- to account for
write-allocate traffic on the interconnect. They are
useful to compare against the peak bandwidth of 10,667 MBps.

           Copy Scale Vadd Triad
 Cpus MBps MBps MBps MBps
     1 246 246 269 269
     8 1,907 1,905 2,059 2,062
    16 3,557 3,620 3,922 3,873
    24 5,353 5,366 5,722 5,740
    32 6,595 6,612 6,888 6,917
    40 7,976 8,062 8,215 8,296
    48 8,942 9,083 9,148 9,219
    56 9,274 9,456 9,508 9,505
    63 9,461 9,586 9,604 9,596

3. VIS assembler "experimental" Stream bandwidth

The SPARC Visual Instruction Set (VIS) includes
block load and store instructions which move between a
64-byte aligned block of memory and eight floating-point
registers. Because an entire cache-block is accessed, no
extra write-allocate traffic is necessary on the interconnect

Comparing to Table 2, My VIS assembler code loops
get a bit more total interconnect traffic
outstanding than the stock C code did.

           Copy Scale Vadd Triad
 Cpus MBps MBps MBps MBps
     1 325 322 288 263
     8 2,499 2,491 2,252 2,099
    16 4,527 4,669 4,243 3,944
    24 6,720 6,759 6,156 5,860
    32 7,872 7,987 7,377 7,092
    40 9,277 9,355 8,877 8,594
    48 9,938 9,917 9,618 9,373
    56 10,250 10,175 10,030 9,910
    63 10,307 10,180 10,181 10,107



This archive was generated by hypermail 2b29 : Tue Apr 18 2000 - 05:23:06 CDT