Trying to understand memory throughput numbers from architecture
				  R. Krivec
				  01/05/08

        Abstract. An attempt to explain the STREAM numbers from the
	network part of machines (buses and crossbars) is found to
	fail; instead, the memory throughput benchmark is found to be
	proportional to processor MHz (!). It could be a coincidence
	because the role of cache coherency is unclear.


The architectural calculations attempt to reproduce Veberic numbers (DV).
These are correlated with 1/2 of STREAM COPY values. (Veberic numbers are
approximate values from graphs
(http://www-f1.ijs.si/~krivec/bench/dv_mem_a.ps) where memory chunks are
larger than L2 cache.) STREAM COPY results are for 1 processor, or, if this
datum is not available, approximately extrapolated to 1 processor.


------------------------------------------------------------------------------
MACHINE                                     DV      STREAM    STREAM-1 P./NODE
Attempted explanation                                          (no contention)
------------------------------------------------------------------------------
                                        (1 p.)  No. p.  Value    No. p.  Value
------------------------------------------------------------------------------
GS160/731: aggregate BW = 6.4 GBps                16     9796
(bidirectional?, 4 processors);                    4     2455
1 processor 6.4 GBps/4/2 = 800 MBps        400     1      970
unidirectional;
cache coherency --> 400 MBps

ES40/833: aggregate BW = 5.2 GBps                  4     2504
(bidirectional?, 4 processors);			   2     1762
1 processor 5.2 GBps/4/2 = 650 MBps 	   600     1     1342
unidirectional;
(NO cache coherency overhead?)

O200/225: 2 processors share a 0.720 GBps          4      632
(peak) bus; crossbar;                              2      330
1 processor 0.72 GBps/1/2 = 360 MBps       150     1      303
(total bus, unidirectional);                      
cache coherency --> 180 MBps (?)

O2000/250: 2 processors share a 0.780 GBps        16     2910     16     5560
(peak) bus; crossbar;                              8     1430      8     2570
1 processor 0.78 GBps/1/2 = 390 MBps 		   4      747      4     1280
(total bus, unidirectional);			   2      361      2      664
cache coherency --> 195 MBps (?)		   1      332

O3800/400: 2 processors share a 1.6 GBps bus;     16     5534
crossbar;
1 processor 1.6 GBps/1/2 = 800 MBps                8     2855
(total bus, unidirectional);                       4     1400
cache coherency --> 400 MBps (?)                   2 *1   700
                                                   1 *1   600?
						   
SGI PowerChallenge/10k                             4      537
(bus)                                              2      351
						   1      172

Sun_UE_6001                                        4      921
(bus)                                              3      694
                                                   2      460
                                                   1      281

HP_V2600_Enterprise                                8     1539
(crossbar?)                                        1      390

HP_N4000 (bus; 440 MHz?)                           8     1759
                                                   1      760
------------------------------------------------------------------------------
----------------------------
*1  Extrapolated


Remarks:

1. About STREAM (http://www.cs.virginia.edu/stream/)

  ES40: fast on 1 proc., scales bad, reaches 1/2 aggregate QBB BW at 4 proc.

  GS160: slower than ES40 on 1 proc., scales better, reaches 1/2.5 aggregate
  QBB BW at 4 proc. (still less than ES40).

  STREAM for Origins is approximately proportional to processor MHz. This
  explains why white papers from 1997 list the same network speeds.

2. About attempted explanations:

  Explanations should explain the DV (unidirectional) values. They seem to
  work ES40 and GS160 and also somewhat for Origin 3800, because probably
  the aggregate BW can be utilized at 400 MHz but not at lower MHz. (Older
  Origins are slower because of smaller MHz.)

  The only sensible comparison is between GS160 and Origin 3800. Here we
  have proportionality with MHz again!


CONCLUSION

  In the above class of machines, STREAM is just roughly proportional to
  MHz! In addition, ALL machines suffer from contention when going from 1 to
  2 processors, so 4 processors only get approximately double the speed of 1
  processor. However, the bus machines (Power Challenge and UE) do scale
  linearly with number of processors. (There are no data for newer Sun
  machines.) The values are lower than expected, though.

  This benchmark is linearly dependent on the MHz or Linpack benchmark on
  the interseting set of machines and therefore unusable.